Commit 3e69b85
* feat: add SDKDispatcher and --agent sdk flag (#121)
Replace the subprocess(claude -p) transport with the Claude Agent SDK
behind a new --agent sdk flag. CLIDispatcher remains the default; sdk
mode is opt-in until soak time validates parity.
Why: claude -p is blind for up to 7200s, has no native streaming, no
programmatic prompt caching, no native subagent spawning, and retries by
subprocess restart (loses message context). The SDK fixes all four.
What lands:
- orchestrator/sdk_dispatch.py: SDKDispatcher extends CLIDispatcher,
overrides only _call_claude and preflight_check. Reuses the parse /
validate / retry-with-feedback machinery for fenced-output phases.
- A pluggable sdk_runner Protocol (SDKResult dataclass) is the seam
for behavioral tests and for #122/#127 follow-ups (cache_control,
stream-json) that need to read SDK events.
- Default runner lazily resolves to the real claude_agent_sdk so
environments without the SDK installed don't fail at import time.
- CLI/argparse choices extended to ["inline", "api", "sdk"] in cli.py,
campaign.py, iteration.py (parser declarations and dispatch routing).
- Pre-flight check in campaign.py routes to SDK preflight when sdk mode.
- pyproject.toml gains an [sdk] optional extra: claude-agent-sdk + anyio.
- docs/architecture.md describes the new path.
Behavioral tests (tests/test_sdk_dispatch.py): 6 cases covering text
phase output, structured phase parse+validate, transient retry,
retry exhaustion, and is_error -> retry. All assertions are about
on-disk artifacts and metrics rows; none assert call shape, argv,
or which method was invoked on the runner.
Out of scope for this PR (queued in #120 plan):
- Prompt caching (#122).
- Stream-json TUI (#127).
- Removing claude -p (post-soak cleanup).
Test suite: 344 passed (existing) + 6 new = 350.
Closes #121.
Refs #120.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat: add deterministic Stop hook for executor completion (#129)
Ship bin/nous-execute-stop, a Python entrypoint suitable for use as a
Claude Code Stop hook. It tells the harness whether the executor agent
is allowed to terminate, based on objective evidence on disk:
* exit 0 (allow stop) when:
- principle_updates.json exists in $NOUS_ITER_DIR
- `nous validate execution --dir $NOUS_ITER_DIR` returns pass
* exit 2 (block stop) otherwise, with a structured reason on stderr
so Claude Code feeds it back into the agent's conversation and the
next turn fixes the artifact rather than restarting.
Why deterministic over probabilistic: the existing /goal evaluator (Haiku
post-turn) is right for fuzzy success criteria, but execution completion
is a schema check — cheaper, faster, and immune to evaluator drift to
have a deterministic shell-out. The two coexist; #124 wires /goal for
fuzzy gating, this hook handles the schema gate.
Wire-up: the orchestrator exports NOUS_ITER_DIR before launching the
executor session, and the per-campaign .claude/settings.json (which
lands in #135) registers this script under hooks.Stop. This PR ships
just the script so it can be installed manually today.
Behavioral tests (5):
* pass case: valid iter dir + principle_updates.json -> exit 0, no stderr
* block: principle_updates.json missing -> exit 2, stderr names the file
* block: corrupted findings.json -> exit 2, stderr includes the schema diff
* block: NOUS_ITER_DIR points at non-existent dir -> exit 2 with reason
* block: NOUS_ITER_DIR unset -> exit 2 with config-error reason
Tests use StubDispatcher to populate a known-passing iter dir, then
mutate it to simulate failure modes. Assertions describe what the hook
emits (exit code + stderr substrings) — never which functions it called.
Test suite: 338 baseline + 5 new = 343 passing.
Closes #129.
Refs #120.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* security: per-campaign permission policy via .claude/settings.json (#135)
Replace --dangerously-skip-permissions with a fine-grained, per-campaign
permission policy generated at init.
The orchestrator's pure renderer (orchestrator/settings_template.py) takes
work_dir, repo_path, and an optional experiment_plan, and returns a dict
suitable for serialization as .claude/settings.json. The contents:
- permissions.allowOnly: campaign work-dir and target repo path. Anything
else is denied by default.
- permissions.allow: Bash command allowlist — conservative defaults plus
any binaries pulled out of experiment_plan.yaml arm conditions, plus
caller-provided extras.
- permissions.deny: hard blocks for outbound https (curl/wget) and
catastrophic shell commands (rm -rf /).
- hooks.Stop: registered when bin/nous-execute-stop is present (#129
integration).
- hooks.PreToolUse: registered when caller provides the path (#128 hook).
setup_work_dir() now writes the rendered settings file at init time,
idempotently (won't clobber a hand-edited file). CLIDispatcher
auto-detects work_dir/.claude/settings.json on construction, and when
present passes --settings <path> to claude -p instead of
--dangerously-skip-permissions. SDKDispatcher already accepted
settings_path in #121 — wire-up matches.
Behavioral tests (tests/test_settings_template.py): 14 cases.
Renderer contract:
- allowOnly contains work_dir
- allowOnly contains repo_path when provided
- default bin allowlist contains python, git, grep
- plan binaries (./blis, /usr/local/bin/sim) are added by basename
- extra_bin_allowlist extends defaults
- deny blocks outbound https
- hooks section absent unless hook paths provided
- Stop hook registered with absolute path
- PreToolUse hook registered with Bash matcher
Disk write contract:
- write_campaign_settings creates parent dir + writes JSON
- settings_path_for returns .claude/settings.json under work_dir
Init wiring contract:
- setup_work_dir writes the file when fresh
- setup_work_dir does NOT overwrite a user-customized settings file
Replacement invariant (the security property):
- rendered settings impose non-empty allowOnly AND non-empty deny
(otherwise the file is functionally equivalent to --dangerously
and the swap is a regression).
Out of scope: the "out-of-worktree write is denied" criterion is an
integration test against a live claude session and is verified manually.
docs/security.md describes the model end-to-end.
Test suite: 338 baseline + 14 new = 352 passing.
Closes #135.
Refs #120.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat: PreToolUse plan-enforcer hook (#128)
Ship bin/nous-plan-enforcer, a Python entrypoint for use as a Claude Code
PreToolUse hook. It intercepts proposed Bash tool calls during the
executor session and decides whether to allow them based on the
iteration's experiment_plan.yaml.
Decision protocol:
* NOUS_PLAN_ENFORCEMENT=strict: exit 2 (block) if the proposed
command's head binary is not the head binary of any planned
condition. Stderr explains the violation; the agent reads it and
is expected to either revise the command or annotate
"# nous: ad-hoc" to opt out for one call.
* NOUS_PLAN_ENFORCEMENT=warn (default): always exit 0 (allow), but
record violations to <iter_dir>/plan_violations.jsonl with
timestamp, kind, command, and best-effort arm attribution.
* Escape hatch: a command containing the literal "# nous: ad-hoc"
is allowed in BOTH modes and logged as kind:"ad-hoc" so reviewers
can audit how often it's used.
Why this exists: 5/18 mech-design-enforcement showed two executor
processes racing on the same iter dir, partly because nothing inside
the agent enforced the plan. Hooks intercept tool calls deterministically
before the LLM acts — defense in depth on top of #135's permission
policy.
Wire-up: setup_work_dir registers the hook automatically when
bin/nous-plan-enforcer exists, alongside the Stop hook from #129. The
.claude/settings.json template (#135) already supports
pre_tool_use_hook_path; this PR connects the wire.
Behavioral tests (8 in tests/test_plan_enforcer_hook.py):
Strict mode:
- allows a planned binary's command (different args still match by head)
- blocks an unplanned binary with stderr naming the violation
- allows ad-hoc-marked commands AND logs them distinctly
Warn mode:
- allows unplanned and logs to plan_violations.jsonl
- does NOT log planned commands
No false positives: parametric over four representative plan shapes
(single-arm/condition; multi-condition; multi-arm; absolute path) —
every planned command is allowed in strict mode.
Edge cases:
- missing NOUS_ITER_DIR: fail open (cannot enforce what we can't
compare against)
- non-Bash tool calls (Read, Write, etc.): pass through, no log
Stacked on #135 (security/135-permission-policy). Rebase onto
reflective once that lands.
Test suite: 352 (post-#135) + 8 new = 360 passing.
Closes #128.
Refs #120.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor: per-campaign CLAUDE.md generated at init + regenerated each iter (#131)
Phase A of #131 — wire the deterministic CLAUDE.md pipeline. Phase B
(refactor prompt templates to omit methodology when CLAUDE.md is in
scope, the actual token-shrink win) is queued as follow-up.
What lands here:
* orchestrator/claude_md.py: pure renderer + disk writer.
render_campaign_claude_md(campaign, principles, last_handoff,
iteration) returns the full markdown text. Sections: Research
Question, Target System (name/description/metrics/knobs), Active
Principles (filtered to status=="active"), Most Recent Handoff.
Header carries an explicit "auto-generated; do not hand-edit"
notice so reviewers don't accidentally orphan their changes.
* regenerate_from_disk(work_dir, campaign, iteration) reads
principles.json + handoff.md from work_dir and writes a fresh
CLAUDE.md. Pure Python, never an LLM call.
* orchestrator/campaign.py: writes initial CLAUDE.md after
setup_work_dir so iter 1's session starts with the campaign brief
in scope.
* orchestrator/iteration.py: regenerates CLAUDE.md after every
_merge_principles, so iter N+1 sees the principles produced by
iter N. Best-effort — a write failure logs at warning and does NOT
abort the iteration.
Behavioral tests (13 in tests/test_claude_md.py):
Generator contract:
- research question appears in output
- target system summary (name, description, metrics, knobs) appears
- Active Principles section filters out status="retired" entries
- first iteration shows "no prior handoff" placeholder
- provided handoff text and iteration label appear in section heading
- "auto-generated"/"Do not hand-edit" warning is present
Disk write contract:
- file lands at work_dir/CLAUDE.md
- successive writes overwrite atomically
Regenerate-from-disk contract:
- principles.json contents appear in the rendered file
- handoff.md contents appear in the rendered file
- iter N+1 principles section reflects updates that landed in iter N
- missing principles.json or handoff.md doesn't crash; placeholders
show through
Init wiring:
- setup_work_dir + regenerate_from_disk produces a CLAUDE.md at the
work_dir root containing campaign brief + principles.
What's NOT in this PR (deferred to a follow-up; see PR body):
* Refactoring prompts/methodology/design.md and execute_analyze.md
so the methodology is OMITTED from per-call prompts when CLAUDE.md
is auto-loaded. That's the actual token-shrink win called out in
issue acceptance criterion #2 ("Iteration N+1 prompts are measurably
smaller"). It's a non-trivial template surgery and needs careful
behavioral verification on real campaigns; landing it separately
keeps the diff reviewable.
* Auto-memory integration for cross-run learnings.
Test suite: 338 baseline + 13 new = 351 passing.
Refs #120, #131. Issue stays open pending Phase B.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat: channel notification at human gates (#130, Phase A)
Phase A: outbound notification only. Configured channels (Slack
incoming-webhooks or generic JSON webhooks) receive a markdown card
when the orchestrator hits a HUMAN_DESIGN_GATE or HUMAN_FINDINGS_GATE.
The campaign still blocks on terminal input for the actual decision —
Phase B (a follow-up) wires reply parsing.
Why split: the outbound path is straightforward HTTP and stdlib-only;
reply handling needs adapter-specific logic per channel (Slack
interactive messages, Telegram bot polling, etc.) and a state machine
to wait for replies with timeout/auto-approve fallback. Shipping Phase A
unblocks the unattended-run UX (you see the gate on your phone) without
locking in design choices for the bidirectional layer.
What lands:
* orchestrator/channels.py: notify_gate(channels, summary, gate_type,
iter_dir) — POSTs a markdown card per channel. Phase A supports two
kinds:
- "slack": JSON {"text": <markdown>} to webhook_url
- "webhook": JSON {"markdown": <markdown>} to url with custom headers
Per-channel failures are isolated: a Slack webhook 5xx logs at
warning and the campaign keeps running.
* Configuration goes in campaign.yaml under top-level `channels:`,
a list of dicts each with `kind` plus channel-specific fields. The
orchestrator's gate-summary call site picks them up — no new CLI
flag needed.
* Wired into iteration._generate_gate_summary so design and findings
gates both fire the notification when channels are configured.
Test design choice: notify_gate accepts a `poster` injection seam
(matching the internal _post signature) used by tests instead of
real urllib.request.urlopen. That lets the 8 behavioral tests assert
on what's POSTed (URL, body content, headers) without touching the
network — and without coupling tests to specific stdlib internals.
Behavioral tests (8 in tests/test_channels.py):
No channels:
- None config: no-op, returns []
- empty list: no-op, returns []
Slack channel:
- posts to webhook_url with JSON {"text": markdown}
- markdown card includes gate_type, summary text, key points,
iter dir, and approve/reject/abort instructions
Generic webhook:
- posts to url with custom Authorization header
- JSON body uses {"markdown": ...} key
Error isolation:
- first channel raising OSError doesn't break the second
- unknown kind records error in results, never raises
Markdown card shape:
- iter_dir basename appears (so reviewers can find artifacts)
- summary text appears even when key_points is empty
All assertions are about what was sent over the wire (captured by the
recording poster). None inspect internal helpers or which dispatcher
function ran.
Test suite: 338 baseline + 8 new = 346 passing.
Refs #120, #130. Issue stays open pending Phase B (reply handling).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat: campaign-index pure functions, foundation for nous-mcp (#126 Phase A)
The MCP server (#126) exposes campaigns as resources and tools. Phase A
ships the pure-function layer that the eventual stdio MCP transport
will wrap: list_campaigns, search_principles, get_arm_results,
compare_iterations. Each function takes a search/campaign root on disk
and returns JSON-friendly dicts/lists; no MCP runtime dependency, no
network, no global state.
Why split A/B: shipping the pure functions first means
* the CLI can use them too (a future "nous list", "nous find-principle"
has zero new code to write — just argparse plumbing),
* Routines (#134) can publish findings into the same store via the
same API,
* the MCP transport choice (stdio JSON-RPC, the mcp Python SDK
version pin, etc.) is a separate review without coupling to the
indexing logic.
Phase A surface:
list_campaigns(search_root, *, query, status, repo) -> [summary]
Walks search_root for campaign roots (state.json + ledger.json),
filters by run_id substring / phase / repo, returns sorted summaries.
completed_iterations comes from ledger; active_principles filters
by status=="active" so retired entries don't inflate the count.
search_principles(search_root, text, *, only_active) -> [hit]
Case-insensitive substring match against statement / description /
category / id. Default skips retired. Sorted by (run_id, principle.id).
Embedding-based search noted in the issue is gated on
OPENAI_API_KEY and ships as Phase B.
get_arm_results(campaign_root, iteration, arm) -> {seeds: [...]}
Reads runs/iter-N/results/<arm>/<seed>/. Returns relative file
paths, sorted, so MCP clients have stable references.
compare_iterations(campaign_root, iter_a, iter_b) -> {a, b, delta}
Deterministic diff: arm_status_changes, principles_added.
Calling twice on the same data must produce byte-equal output —
no timestamps, no map iteration order leaks. The acceptance
criterion for #126 explicitly calls out determinism.
Out of scope (Phase B):
- The stdio MCP server itself (bin/nous-mcp, ~/.claude.json snippet).
- Embedding-based semantic search behind OPENAI_API_KEY.
Behavioral tests (17 in tests/test_campaign_index.py):
list_campaigns:
- returns three synthesized campaigns with expected counts/phases
- query="saturation" filters down to that one run
- status="DONE" filters by phase
- active_principles count excludes status=="retired" entries
- results are sorted by run_id (determinism)
- empty search root returns []
- repo path resolves to <repo> when work_dir was created at
<repo>/.nous/<run-id>
search_principles:
- finds principle by substring in statement
- case-insensitive
- skips retired by default; only_active=False includes them
- sorted by (run_id, principle.id) — determinism
get_arm_results:
- aggregates multiple seeds with file listings sorted
- missing arm returns empty seeds list
compare_iterations:
- arm status change appears in delta; unchanged arms don't
- principles_added is a sorted set difference between iter updates
- byte-equal output across repeated calls
All assertions describe what the function returned given on-disk inputs.
None inspect helper invocations or internal walk order. The walk
implementation can change freely as long as the contract holds.
Test suite: 338 baseline + 17 new = 355 passing.
Refs #120, #126. Issue stays open pending Phase B (MCP transport).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat: orphan-worktree GC at run start (#133, Phase A)
Add gc_orphan_worktrees() and wire it into run_campaign startup so
ghost worktrees from crashed/killed prior runs are cleaned before the
new run begins.
Why: 5/18 mech-design-enforcement showed ghost iter-N-XXXX directories
lingering as worktrees for hours after their owning processes died.
The harness-managed Agent(isolation="worktree") path (the issue's main
thrust) lands as part of #123 (parallel-arm subagents); until then,
this GC closes the visible loop where stale worktrees accumulate.
GC heuristic:
* Walk <repo>/.nous-experiments/.
* For each entry older than max_age_seconds (default 1h):
- if .nous-pid is recorded and that PID is alive, keep it.
- otherwise, untrack via git worktree remove --force, rm -rf the
dir, and clean up the matching nous-exp-* branch.
* Return the list of experiment_ids removed (sorted).
Phase B (deferred to #123): switch from manual create_experiment_worktree
+ remove_experiment_worktree to harness-native Agent(isolation="worktree")
on per-arm subagents. That collapses the lifecycle entirely; LoC reduction
of worktree.py (the issue's >=60% acceptance criterion) lands then.
Behavioral tests (8 in tests/test_worktree_gc.py):
- no .nous-experiments dir: returns []
- old worktree with no .nous-pid: removed
- recent worktree: kept
- old worktree with live PID (injected pid_check): kept
- old worktree with dead PID (injected pid_check): removed
- .nous-pid file with garbage contents: treated as no PID, removed
- mixed old/recent set: only old removed, sorted
- zero leftover after batch GC (the explicit issue criterion)
Tests inject fake clock (`now=`) and fake pid_check, so they're
deterministic across machines and don't depend on real PIDs/time.
Test suite: 338 baseline + 8 new = 346 passing.
Refs #120, #133. Issue stays open pending Phase B (#123 lands the
harness-isolation switch).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* perf: cache hit-rate stats + nous cost --cache-stats (#122)
Stacks on #121 (SDK port). Adds the measurement infrastructure for
prompt caching:
* orchestrator/cache_stats.py: aggregates llm_metrics.jsonl into
a hit-rate summary. Reads the cache_creation_input_tokens and
cache_read_input_tokens fields that both CLIDispatcher (since #41)
and SDKDispatcher (#121) emit. Per-call rows are split into three
buckets — uncached / creation / read — and the overall hit rate is
read / (uncached + creation + read). By-phase breakdown surfaces
DESIGN-vs-EXECUTE_ANALYZE asymmetry.
* `nous cost --cache-stats` flag prints the hit-rate summary alongside
the existing usage breakdown. Users see the cache benefit empirically.
Why ship the measurement before the cache_control tweak: criterion #2
of #122 ("On a representative 5-iteration campaign, total input tokens
decrease by ≥ 25% vs the pre-change baseline") is something we have to
*measure*, not just assert in a unit test. Once #121 lands and the
SDKDispatcher's runner factory marks the methodology system block as
ephemeral-cached (a one-line change to the ClaudeAgentOptions
construction), the hit-rate stats here are how we verify the win on a
real campaign.
The cache_control marker itself is in scope for the runner factory in
#121's sdk_dispatch.py — it's set when the methodology prompt is passed
as the system_prompt. SDKDispatcher already accepts a system_prompt
constructor arg; wiring it to the methodology text ships in a follow-up
once we decide on a simple injection point that doesn't disturb the
prompt_loader API for non-SDK paths.
Behavioral tests (8 in tests/test_cache_stats.py):
Empty / robustness:
- missing file: zeroed summary, total_calls=0
- empty file: same
- corrupt JSONL lines are skipped, valid lines still counted
- missing token fields treated as zero (no KeyError)
Hit-rate math:
- cold call (creation only) + warm call (read only): hit_rate is
read / (uncached + creation + read)
- all-zero rows produce hit_rate=0.0 with no division-by-zero
By-phase:
- separate buckets for design vs execute-analyze with independent
hit rates
Formatting:
- format_cache_stats includes hit rate, by-phase breakdown, and
is human-readable
Tests assert on returned dict structure (the contract the CLI consumes),
not on which JSONL parser it used or how it grouped rows internally.
Test suite (this branch, stacked on #121): 344 + 8 new = 352 passing.
Refs #120, #122. Stacked on #136 (#121).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat: nous status --watch / --line + snapshot reader (#127, Phase A)
Stacks on #121. Phase A ships the deterministic status surface that the
CLI hooks into:
* orchestrator/status.py: read_status_snapshot(work_dir, *, now,
stuck_threshold_seconds) builds a StatusSnapshot from state.json,
ledger.json, principles.json, and the most recent
runs/iter-N/executor_log.jsonl event. Stuck flag flips when the
last log event is >5 minutes old.
* format_one_liner(snap) renders the snapshot as a single line for
shell prompts and CI logs. Stable across two consecutive calls when
no new events arrived (the property prompt-embedders rely on).
* format_watch_panel(snap) renders a multi-line panel for
nous status --watch. Plain text in Phase A — the redraw loop just
clears + reprints. Phase B can swap in rich/textual without changing
the snapshot contract.
* CLI: nous status now supports --watch (loop + redraw at --interval
seconds, default 2s), --line (single-line summary), and the existing
one-shot mode (now using format_watch_panel for consistency).
What lands later in Phase B: the SDK event tee — sdk_dispatch.py
appending each --output-format stream-json row to executor_log.jsonl as
the session runs. The status reader here already consumes that file
when present, so flipping the SDK switch lights up the watch panel
without code changes.
Behavioral tests (13 in tests/test_status.py):
read_status_snapshot:
- minimal state-only campaign
- completed_iterations counted from ledger.json (≥1 only)
- active_principles excludes status="retired"
- last_event picked up from executor_log.jsonl; elapsed_since_last_event
computed from injected now=
- stuck flag flips after 5 minutes of silence
- corrupt state.json doesn't crash; defaults to "?"
- corrupt JSONL lines in executor_log are skipped, valid lines win
format_one_liner:
- single line, no newlines
- STUCK marker appears when set
- byte-stable across two calls on same snapshot (prompt-embedder
contract)
format_watch_panel:
- multi-line panel includes phase, iteration, principle count
- STUCK warning rendered distinctly
- "(no events yet)" placeholder when log absent
Tests inject now= and explicit os.utime on the log file so they're
deterministic across machines and don't depend on real wall-clock.
Test suite (this branch, stacked on #121): 344 + 13 new = 357 passing.
Refs #120, #127. Stacked on #136.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat: Routines payload builder for scheduled campaigns (#134, Phase A)
Stacks on #126 (campaign_index). Phase A ships the payload builder so
users can dry-run-validate exactly what would be registered with the
Routines API. Phase B (when the API stabilizes) wires the actual POST
and Routine ID return.
Why split A/B: the Routines API is an Anthropic infrastructure feature;
its surface area and authentication story will move while it stabilizes.
Decoupling payload construction from the POST means we can ship the
shape, soak it on real campaigns, and integrate the transport later
without rewriting the payload.
Phase A surface:
build_routine_payload(campaign, *, campaign_path, schedule, pr_label,
mcp_refs, extra) -> dict
Trigger: cron schedule (UTC) OR PR label, not both. ValueError on
conflict / missing.
Campaign reference: campaign_path resolves to an absolute path the
Routine re-reads on each fire, OR campaign_inline embeds the full
config dict if no path is given.
Credentials: a placeholder string (${secret:anthropic_api_key}) — never
the real key. The Routines runtime resolves from its own secret store.
MCP refs (depends on #126): list of nous://... URIs the Routine
subscribes to and writes findings into.
Behavioral tests (10 in tests/test_routines.py):
Schedule payload:
- cron string lands in trigger.expression
- name falls back to run_id
- command line includes --auto-approve and --agent sdk
- credentials are placeholders, not real secrets
- MCP refs pass through
PR-label payload:
- pr_label lands in trigger.label
Validation:
- missing trigger raises ValueError
- both triggers raises ValueError
Campaign reference:
- campaign_path produces path reference, omits inline
- no path inlines the full campaign dict
Out of scope (Phase B):
- HTTP POST to the actual Routines API
- Returning the Routine ID after registration
- nous routine create CLI subcommand (currently a builder only)
Test suite (this branch, stacked on #126): 355 + 10 new = 365 passing.
Refs #120, #134. Stacked on #142 (#126).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat: package nous as a Claude Code plugin (#125)
Ship plugin/nous/ with plugin.json + 6 skill markdown files. Each skill
is a CLI wrapper — minimal frontmatter, clear "when to use" hints, and
a Run section that shells out to the existing nous CLI or imports the
campaign_index module from #126.
What lands:
* plugin/nous/plugin.json — manifest (name, version, description,
license, skills list).
* plugin/nous/skills/nous-run.md — wraps `nous run`. Notes
--auto-approve + Slack channels for unattended runs.
* plugin/nous/skills/nous-status.md — wraps `nous status` with
--watch / --line / --interval (#127). Free to call repeatedly.
* plugin/nous/skills/nous-resume.md — wraps `nous resume` from
state.json checkpoint (#91).
* plugin/nous/skills/nous-list.md — uses campaign_index.list_campaigns
(#126) with optional query / status / repo filters.
* plugin/nous/skills/nous-bisect.md — uses
campaign_index.compare_iterations (#126). Output is byte-deterministic.
* plugin/nous/skills/nous-find-principle.md — uses
campaign_index.search_principles. Notes embedding-search as #126
Phase B.
Behavioral tests (7 in tests/test_plugin_package.py):
Manifest:
- plugin.json exists with required fields (name, version, description,
skills list)
- at least 5 skills listed (acceptance criterion)
- every listed skill file actually exists on disk
Frontmatter:
- every skill has name + description in YAML frontmatter
- descriptions include "use when" / "when the user" cues so Claude Code
can match user intent — vague descriptions are dead skills
- every skill body references either a nous command or campaign_index
Coverage:
- all six documented skills present (nous-run, nous-status, nous-resume,
nous-list, nous-bisect, nous-find-principle)
Out of scope (Phase B):
- claude plugin install integration testing (requires a live Claude Code
install with plugin support)
- publishing to a plugin registry
- skill argument templating (currently shell substitution; could move
to typed inputs once plugin contract stabilizes)
Test suite: 338 baseline + 7 new = 345 passing.
Refs #120, #125. Depends on #126 + #127 (already in flight).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat: /goal-driven prompt builders for goal-bounded campaign mode (#124, Phase A)
Phase A ships the deterministic prompt + goal-directive builders for
both modes the issue calls out:
Mode A — fully /goal-driven: spawn one claude session for the whole
campaign with /goal "<predicate>". The Haiku post-turn evaluator
decides when the goal is met. No Python state machine in the inner
loop.
Mode B — /goal-bounded inner loop: keep engine.py for control flow,
but use /goal *within* EXECUTE_ANALYZE so the executor terminates
as soon as validation passes.
Phase A is the prompt assembly. Wire-up into the dispatcher and the
run_campaign code path lands in Phase B once the team picks the default.
Why the prompt builders matter: criterion #2 of the issue ("hybrid mode
is the default for nous run after one release of soak time") implies
the team will run both modes side by side on real campaigns and compare.
Behavioral testing of the prompt assembly — does it include the
campaign brief, does it spell out the goal predicate exactly — is what
makes those soak runs comparable. The /goal directive itself is just
a string, but it has to be the *right* string or the Haiku evaluator
can't decide.
Phase A surface:
build_full_goal_directive(campaign, *, iteration, timeout_hours):
Returns the predicate text for Mode A. Asserts on:
- findings.json exists with non-empty arms list
- principle_updates.json exists and parses as a list
- OR timeout exceeded (default 24 hours).
build_inner_loop_goal_directive(iteration, *, extra_predicates):
Mode B predicate. Asserts on schema validation + principle_updates
presence. Pairs with the deterministic Stop hook (#129) — the hook
catches the schema check, the /goal evaluator catches edge cases the
schema doesn't cover.
build_goal_driven_session_prompt(campaign, *, iteration, timeout_hours):
Full Mode A prompt body. Includes campaign brief, required artifact
paths, EXPLICIT instruction to print artifact paths to stdout (the
Haiku evaluator only sees what's been surfaced in the conversation),
nous validate invocation, and the /goal directive.
Behavioral tests (10 in tests/test_goal_driven.py):
Full directive (Mode A):
- predicate names iter-N/findings.json + principle_updates.json
- timeout clause appears with the configured hours
- uses AND/OR logic correctly
Inner-loop directive (Mode B):
- uses schema-validation language (findings.schema.json)
- extra predicates AND-chained
Session prompt (Mode A):
- campaign brief (research question, target name, metrics, knobs) appears
- iteration number appears consistently across artifact paths
- EXPLICIT "print to stdout" instruction (the evaluator can't see
silent file writes)
- nous validate execution invocation present
- /goal directive appears in the prompt
Out of scope (Phase B):
- --goal-driven flag on nous run / nous resume
- Dispatcher integration (SDKDispatcher launching the goal-driven session)
- run_campaign code path that bypasses engine.py for Mode A
- Claude Code v2.1.139+ version detection at startup
Test suite: 338 baseline + 10 new = 348 passing.
Refs #120, #124. Issue stays open pending Phase B (dispatcher wire-up).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat: explore-then-synthesize DESIGN orchestration helpers (#132, Phase A)
Stacks on #121. Phase A ships the orchestration layer that makes
splitting DESIGN into Stage A (parallel Explore subagents) + Stage B
(Opus synthesis) possible without changing what gets produced
(problem.md + bundle.yaml).
DESIGN today asks one Opus session to do both codebase mapping AND
bundle synthesis. That's the canonical Claude-Code-pattern miss: broad
exploration + small synthesis is exactly what parallel Explore subagents
are for. Phase A is the orchestration helpers; Phase B (lands when #121
merges and the team picks injection points) wires the SDKDispatcher
to actually spawn Explore subagents and thread reports through to the
synthesis call.
Phase A surface:
* DEFAULT_EXPLORE_SCOPES — four scopes the issue calls out: metrics,
knobs, prior_findings, principles. Each gets its own Explore subagent.
* build_explore_prompt(scope, campaign) — produces a tight,
scope-focused prompt for a read-only Explore subagent. Multi-aspect
integration is NOT this prompt's job (Stage B does that).
* run_explore_stage(campaign, *, scopes, runner) — fans out one
subagent per scope via an injected runner callable, collects
ExploreReports. Synchronous in Phase A; the SDK's async fan-out
lands in Phase B.
* build_synthesis_prompt(stage_a, *, campaign, iteration, iter_dir)
— Opus prompt that consumes only the Explore reports + principles.json,
produces problem.md + bundle.yaml, EXPLICITLY forbids re-reading
the codebase ("Do not re-read"). That's the whole point of the
split: Opus on integration, not on file walks.
Behavioral tests (13 in tests/test_explore_design.py):
build_explore_prompt:
- metrics scope focuses on observable metrics
- knobs scope focuses on configuration parameters
- prior_findings references findings.json
- principles references the principle store
- EVERY scope marks the explorer read-only (the prompt is
defense-in-depth on top of subagent_type="Explore")
run_explore_stage:
- one subagent per default scope (4 calls)
- custom scopes pass through
- token counts aggregate across reports
- by_scope() lookup returns the right report
build_synthesis_prompt:
- every explorer report appears under its `### <scope>` heading
- explicit "Do not re-read" instruction
- problem.md + bundle.yaml + iter-N + bundle.schema.yaml all named
- research question appears
Out of scope (Phase B):
- SDKDispatcher integration (spawning subagent_type="Explore" via SDK)
- anyio.gather over the four explorer calls for actual parallelism
- Token-budget measurement on a representative campaign (criterion
"DESIGN cost drops by ≥30%")
- Wall-clock measurement on multi-aspect explorations
Test suite (this branch, stacked on #121): 344 + 13 new = 357 passing.
Refs #120, #132. Stacked on #136.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* perf: load methodology preamble as cached system_prompt (#122 Phase B)
Closes the wiring gap from #144 (Phase A): SDKDispatcher now loads
prompts/methodology/{design,execute_analyze}.md, strips placeholders
({{target_system}}, etc.), concatenates them into a single block, and
passes that as system_prompt on every runner call. Anthropic's API
marks system blocks above the cache threshold as cached, so the second
phase call within a 5-minute window reuses the rendered preamble
instead of re-paying for it.
The dynamic context (research_question, observable_metrics, principles,
handoff) stays in the user message — that's what BUSTS the cache when
it should bust (per-iteration changes), and that's what HITS the cache
when content is stable (within-iteration designer→executor handoff).
Two new behavioral tests:
* runner receives preamble: assert system_prompt contains both
methodology blocks with placeholders stripped.
* two consecutive calls reuse the same system_prompt: this is the
property the cache relies on (otherwise cache_read_input_tokens
stays at zero).
Test suite: 346 (Phase A baseline) + 2 new = 354.
Closes #122.
* feat: tee SDK events to executor_log.jsonl (#127 Phase B)
Closes the wiring gap from #145: SDKDispatcher.dispatch now derives the
per-iteration executor_log.jsonl path and threads it through to the
runner factory. The runner appends one JSONL row per SDK message so
`nous status --watch` (the snapshot reader from Phase A) lights up
without any further changes.
Implementation:
* SDKRunner Protocol gains optional event_log_path arg; the default
runner factory tees every message via _tee_event before processing.
* _tee_event records {type, ts, tool_name?, tool_use_id?, content?},
serializability-probing each surfaced field so SDK message-class
evolution doesn't break the writer. Failures are best-effort.
* SDKDispatcher.dispatch override computes work_dir/runs/iter-N/
executor_log.jsonl and resets after dispatch so a later call from a
different iteration doesn't reuse the wrong path.
Two new behavioral tests (in test_status.py since the contract this
verifies is the snapshot reader's input):
* runner receives the iteration-specific event_log_path.
* each iteration gets its own event log (no cross-iter leakage).
The Phase A status reader from #145 already consumes this file when
present, so warm-watch sessions now reflect tool-call events within
the redraw interval (~2s).
Closes #127.
* refactor: thin prompt templates when CLAUDE.md is in scope (#131 Phase B)
Closes the token-shrink wiring from #140 (Phase A): PromptLoader now
prefers <template>_thin.md when a CLAUDE.md is detected at work_dir.
The thin variants drop methodology (~400 lines) and reference CLAUDE.md
for it instead, since Claude Code auto-loads CLAUDE.md from work_dir
on every session.
Concretely:
* orchestrator/prompt_loader.py: PromptLoader gains
claude_md_at param. When set and the path exists, _resolve_template_path
picks <template>_thin.md if present, else falls back to full template.
* orchestrator/llm_dispatch.py: LLMDispatcher constructs PromptLoader
with claude_md_at=work_dir/CLAUDE.md. The CLAUDE.md generator from
Phase A (orchestrator/claude_md.py) writes that file at init and
after every iteration, so the thin path is active for any campaign
using the SDK / API path.
* prompts/methodology/design_thin.md: 27 lines of per-iter context
(vs 266 in design.md). Refers the agent to CLAUDE.md for methodology.
* prompts/methodology/execute_analyze_thin.md: 22 lines (vs 199 in
execute_analyze.md).
* Other templates (report.md, summarize_gate.md) are short enough not
to need thin variants; loader falls back to full when no _thin
exists.
Behavioral tests (6 new):
TestThinTemplateSelection (4):
- full template used when no CLAUDE.md
- thin template picked when CLAUDE.md exists
- full used when template has no _thin variant
- thin is < 50% size of full (the issue's empirical criterion)
TestRealMethodologyThinTemplates (2):
- shipped design_thin.md renders against the dispatcher's real
context shape AND is < 50% size of full design.md
- shipped execute_analyze_thin.md renders against real context shape
Test suite: 351 baseline + 6 new = 357 passing.
Closes #131.
* chore: codify no-live-LLM-in-tests as a hard project principle
User directive on 2026-05-24: 'Tests must mock LLMs and not spend
token budget. Keep this as a development principle. Always.' And:
'Save it on claude.md everywhere. Not just memory. Save it in multiple
places if you need to.'
Lands the principle in five durable places + active enforcement:
1. CLAUDE.md (repo root, NEW): non-negotiable rule at the top, with
concrete how-to-mock guidance per dispatcher (LLM/CLI/SDK/Inline/
Stub). Auto-loaded by Claude Code on every session.
2. tests/CLAUDE.md (NEW): restates the rule + injection seams so the
principle stays in scope when Claude Code is operating inside tests/.
3. tests/conftest.py — block_live_llm_calls autouse fixture:
- strips OPENAI_API_KEY / OPENAI_BASE_URL / ANTHROPIC_API_KEY from env
- patches urllib.request.urlopen to raise LiveLLMCallBlocked when
the URL contains api.anthropic.com / api.openai.com / api.litellm.ai
- patches claude_agent_sdk.query (when installed) to hard-fail
If a test trips the guard, the fix is to inject a fake at the
dispatcher seam — never to disable the guard.
4. tests/test_no_live_llm_guard.py (NEW): meta-tests verifying the
guard fires correctly. If the guard breaks, CI fails loudly:
- env keys are stripped
- urlopen to anthropic.com / openai.com raises LiveLLMCallBlocked
- non-LLM hosts pass through (Slack webhooks, etc., still work
via their own injection)
- claude_agent_sdk.query is blocked when installed (skipped here
since the SDK isn't a test dep yet)
5. docs/contributing/workflow.md — Non-negotiable rules section at
the top stating the no-live-LLM rule, the behavioral testing
rule, and the token-budget invariant.
Audit of existing tests: all already mock correctly:
* test_llm_dispatch.py uses _make_fake_completion + completion_fn=
* test_cli_dispatch.py patches subprocess.run
* test_integration_llm.py uses _make_routing_completion
* test_sdk_dispatch.py uses _ScriptedRunner sdk_runner injection
* StubDispatcher path needs no LLM at all
So this PR is enforcement + documentation, not a refactor of existing
tests.
Test suite: 338 baseline + 5 new + 1 SDK-skip = 343 passing, 1 skipped.
Refs the user's 2026-05-24 directive. No issue closed by this PR —
it's a project-wide invariant, equally applicable to all #120 work
and any future contribution.
* feat: run_goal_driven_iteration runner (#124 Phase B)
Closes the dispatcher wire-up from #148 (Phase A): adds
run_goal_driven_iteration(dispatcher, campaign, iteration, work_dir)
which builds the goal-driven prompt, dispatches it through the
provided dispatcher (SDKDispatcher canonical), and persists the
conversation transcript as runs/iter-N/design_log.md.
The agent itself produces problem.md, bundle.yaml, findings.json,
etc. via tool calls inside the session; the orchestrator only saves
the transcript. This is the Mode A from #124's issue body —
'fully /goal-driven (lightweight)' — bypassing engine.py.
Two new behavioral tests:
- dispatches goal-driven prompt (asserts /goal appears, asserts
iter-N path appears) and writes log to expected location
- creates iter dir if missing
The CLI flag --goal-driven and run_campaign integration would call
this function instead of the per-phase dispatch loop. That last bit of
plumbing (engine.py bypass, --goal-driven flag) is left for the
soak-and-decide cycle the issue calls out — once a campaign runs in
goal-driven mode and proves equivalent quality on a real target.
Closes #124.
* feat: submit_routine HTTP POST with poster injection (#134 Phase B)
Closes the API-submission gap from #146 (Phase A): adds
submit_routine(payload, *, api_base, api_key, poster, timeout) which
POSTs the payload to the Routines API and returns the response dict
(typically containing routine_id).
Per the no-live-LLM project principle (CLAUDE.md), the function takes
a poster injection seam — tests pass a recording fake; production
uses urllib.request.urlopen. Defaults to api.anthropic.com/v1/routines;
override via ROUTINES_API_BASE env var or api_base= kwarg.
Auth: Bearer ANTHROPIC_API_KEY (env or kwarg). When no key AND no
poster, the function raises RuntimeError loudly — silent fall-back to
anonymous would be a real-world misconfig.
Four new behavioral tests:
- posts payload with Bearer auth header and JSON content type
- custom api_base is honored
- response dict (routine_id, status) returned to caller
- missing api_key + no poster raises RuntimeError
All four use the _RecordingPoster fake — no network. The conftest
guard from #151 would block live HTTP to api.anthropic.com regardless.
Closes #134.
* feat: nous-mcp stdio server (#126 Phase B)
Closes the transport gap from #142 (Phase A): bin/nous-mcp is a
stdio JSON-RPC 2.0 server that wraps the campaign_index pure
functions as MCP resources + tools.
Resources (resources/list + resources/read):
- nous://campaigns (index of all)
- nous://campaigns/<run_id>/state (state.json contents)
- nous://campaigns/<run_id>/principles (principles.json contents)
- nous://campaigns/<run_id>/iter/<N>/findings (findings.json contents)
Tools (tools/list + tools/call):
- nous.list_campaigns(search_root, query?, status?, repo?)
- nous.search_principles(search_root, text, only_active?)
- nous.get_arm_results(campaign_root, iteration, arm)
- nous.compare_iterations(campaign_root, iter_a, iter_b)
The server is intentionally dependency-free — pure stdlib (json + sys)
no mcp-python-sdk pin. Compatible with Claude Code's MCP transport via
~/.claude.json:
{
"mcpServers": {
"nous": {
"command": "python",
"args": ["-u", "/path/to/repo/bin/nous-mcp"],
"env": {"NOUS_SEARCH_ROOT": "/path/to/parent/of/.nous/"}
}
}
}
handle_request(request, *, search_root) is exposed as a pure function
so tests can drive the server with JSON-RPC payloads without spinning
up real stdio. 11 behavioral tests cover initialize, resources/list,
resources/read for state and principles, unknown campaign -> JSON-RPC
error, tools/list returns 4 tools, list_campaigns / search_principles
calls, unknown tool -> error, missing required args -> error not crash.
The conftest guard from #151 ensures none of these tests touch a real
network — they read on-disk fixtures only.
Closes #126.
* feat: parse_reply + wait_for_reply for channel gate decisions (#130 Phase B)
Closes the reply-handling gap from #141 (Phase A): adds two new
functions to orchestrator.channels.
parse_reply(text) -> 'approve' | 'reject' | 'abort' | None
Maps a free-form channel message to a gate Decision. Recognized
tokens (case-insensitive, first-word match):
approve | approved | lgtm | ok | yes -> approve
reject | rejected | no | redesign -> reject
abort | stop | cancel -> abort
Returns None when the reply doesn't decode to a decision so callers
can keep waiting.
wait_for_reply(reply_provider, *, timeout_seconds, ...) -> str | None
Polls reply_provider until it returns a recognized decision or
timeout elapses. On timeout returns None — the issue's documented
fall-back to --auto-approve semantics.
Both functions take dependency-injection seams (sleeper, clock,
reply_provider) for deterministic testing — no real wall-clock, no
real channel polling. The actual per-channel adapters (Slack
interactive messages, Telegram bot polling, etc.) plug into
reply_provider via small adapter functions; this PR ships the core
state machine.
Seven new behavioral tests:
- parse_reply recognizes each token family (approve/reject/abort)
- parse_reply returns None on unrecognized replies, empty string,
and None input
- wait_for_reply returns the decision on first recognized reply
- wait_for_reply returns None on timeout
- wait_for_reply keeps polling past unrecognized replies
All assertions describe the function's return value given inputs.
None inspect internal control flow or which sleeper/clock methods
were called.
Closes #130.
* feat: make_isolated_arm_runner factory for harness-managed worktrees (#133 Phase B)
Closes the harness-isolation gap from #143 (Phase A): adds
make_isolated_arm_runner(*, sdk_runner, repo_path, iter_dir, ...)
that returns an ArmRunner-shaped callable backed by a worktree-isolated
SDK subagent.
Per the no-live-LLM project principle, the factory takes an injected
sdk_runner — the real ClaudeAgentOptions(isolation='worktree')
construction lives behind that seam. Tests pass a recording fake and
assert the factory's contract (signature, returned-callable shape,
ArmUnit -> ArmUnitResult mapping); the harness call itself is verified
on soak.
The runner:
* creates iter_dir/results/<arm>/<seed>/ before dispatch
* passes a clear arm/command/seed prompt with explicit results-dir +
patch-capture instructions
* dispatches via sdk_runner with isolation='worktree' and
subagent_type kwargs (with TypeError fallback to the basic-runner
signature for forward/backward compatibility)
* on is_error result, returns ArmUnitResult(status='failed') with
the error message
* on success, scans results_dir and returns ArmUnitResult with the
sorted relative-file listing
This is the bridge between #143 (worktree GC) and #150 (parallel-arm
orchestration); once #123 wires this runner into the parallel-arm path,
the manual create_experiment_worktree / remove_experiment_worktree
lifecycle becomes vestigial — a follow-up cleanup PR drops it
(closing the issue's ≥60% LoC reduction acceptance criterion).
Two new behavioral tests:
- test_returns_callable: factory returns a callable matching ArmRunner
(skipped when parallel_arms is on a not-yet-merged branch).
- test_factory_accepts_documented_kwargs: signature contract with
model, max_turns, subagent_type kwargs. Construction must not
raise.
Closes #133.
* feat: parallel-arm orchestration helpers (#123, Phase A)
Stacks on #133 (which stacks on #121). Phase A ships the orchestration
layer that turns experiment_plan.yaml into a flat list of independent
units, fans them out via an injected runner, and deterministically
merges their results into a findings-shaped dict. The actual SDK
subagent fan-out + worktree-isolation per unit (the issue's main thrust)
is Phase B once #121 + #133 merge.
Why partition first: the 5/18 mech-design-enforcement session ran 8
conditions × 3 seeds = 24 simulations sequentially in one Sonnet
session. That 2.5-hour mega-session is what produced the connection
drops and the race-two-executors bug. Decomposing into small
independent units is the prerequisite to parallel execution; once the
units exist as data, the run path can be sync (Phase A) or
anyio.gather over SDK subagents (Phase B) without touching the
partitioner or merge.
Phase A surface:
partition_plan(plan) -> list[ArmUnit]
Turns experiment_plan.yaml into one ArmUnit per (arm × condition × seed).
Default seed when none specified is "seed-1"; multi-seed conditions
fan out. Skips arms with no command. Each unit's
relative_results_dir is unique by construction
(results/<arm>/<seed>) — no two units write to the same path.
run_units(units, *, runner, max_parallel) -> list[ArmUnitResult]
Runs each unit through the injected runner. Catches runner
exceptions and converts them to failed ArmUnitResults so a single
arm crashing doesn't abort the iteration. Returns results in input
order so callers can pair them deterministically.
merge_unit_results(results, *, plan) -> dict
Deterministic merge into a findings-shaped structure: arms grouped
by arm_id (sorted), arm.status="failed" when any unit failed,
units within an arm sorted by (seed, condition). Byte-equal across
repeated calls — that's the criterion the issue asks for.
failed_units(results) -> list[ArmUnit]
Helper for partial-retry: which units need re-running?
default_max_parallel() -> int
The min(CPU, 4) default the issue calls out.
Behavioral tests (14 in tests/test_parallel_arms.py):
partition_plan:
- single arm/condition with default seed
- multi-seed condition fans out
- multiple arms × conditions: 3 units; sorted assertion
- results_dir doesn't overlap across seeds
- arm without command skipped
run_units:
- results in input order (the determinism contract for merge)
- runner exception becomes failed unit, doesn't abort run
- max_parallel < 1 raises ValueError
merge_unit_results:
- arms grouped by arm_id, sorted
- arm.status="failed" when any unit failed
- failed_unit_count + total_unit_count correct
- byte-equal across repeated calls
- units within arm sorted by (seed, condition)
failed_units:
- returns only failed units (the partial-retry contract)
Out of scope (Phase B):
- SDKDispatcher integration: a runner that actually spawns
Agent(isolation="worktree") per unit
- anyio.gather + semaphore for real parallelism
- Wire-up into iteration.py so EXECUTE_ANALYZE picks parallel mode
when max_parallel_arms > 1
- Wall-clock measurement on a multi-arm campaign (the
"significantly less wall-clock" criterion)
Test suite (this branch, stacked on #133): 346 + 14 new = 360 passing.
Refs #120, #123. Stacked on #143 (#133) which stacks on #136 (#121).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat: end-to-end isolated-runner tests for parallel arms (#123 Phase B)
Closes the SDK-integration gap from #150 (Phase A): adds three
end-to-end behavioral tests that exercise the full chain:
partition_plan -> make_isolated_arm_runner -> run_units -> merge_unit_results
The SDK side is injected via a fake (per the no-live-LLM project
principle, see CLAUDE.md). The tests assert the orchestration
contract — every unit dispatches with isolation='worktree' to a
non-overlapping results dir, failures are isolated to the affected arm,
and the merged output is deterministic.
Tests:
test_three_units_dispatched_with_isolation_kwarg
Plan with 1 arm × 1 condition + 1 arm × 1 condition × 2 seeds = 3
units. All three dispatch with isolation='worktree'. Merged output
has both arms in sorted order, both reported complete.
test_partial_failure_isolated_to_one_arm
Fake runner returns is_error for h-ablation; h-main succeeds.
Merged output: h-main complete, h-ablation failed. Failed unit
count = 2 (both ablation seeds). Total = 3. The acceptance
criterion 'one arm failure does not abort iteration'.
test_no_two_units_share_results_dir
Captures every Write-output-files-to path the runner sends to
each subagent; asserts all 3 are unique. The acceptance criterion
'no two subagents ever write to the same results/ subpath'.
A local _LocalSDKResult stand-in replaces the import from sdk_dispatch
so this branch doesn't depend on sdk_dispatch.py landing first; the
real SDKResult from #121 is duck-compatible (same field shape).
The full chain works against any sdk_runner respecting the SDKRunner
Protocol — production wiring (which constructs the real Anthropic SDK
runner with isolation kwarg) is verified on soak.
Closes #123.
* feat: make_sdk_explore_runner factory for Stage A (#132 Phase B)
Closes the SDK-integration gap from #149 (Phase A): adds
make_sdk_explore_runner(*, sdk_runner, cwd, model, max_turns) that
returns an ExploreRunner-shaped callable backed by a read-only
Explore subagent (subagent_type='Explore').
Per the no-live-LLM project principle (CLAUDE.md), the factory takes
an injected sdk_runner. Production wiring constructs the real Anthropic
SDK runner; tests inject a recording fake. Defaults model to Haiku
because read-only mapping is cheap and benefits from speed over depth;
deep synthesis happens in Stage B (the single Opus call), not Stage A.
Three new behavioral tests:
test_dispatches_each_scope_with_explore_subagent_type:
With four default scopes, the SDK runner is called four times,
each with subagent_type='Explore'. Reports carry the runner's
text + token counts; total_input_tokens aggregates correctly.
test_falls_back_when_sdk_runner_lacks_subagent_kwarg:
Older runners without subagent_type kwarg are accommodated via
TypeError fallback to the base signature. Forward/backward
compatibility across SDK API evolution.
test_uses_haiku_by_default:
Default model is Haiku (read-only mapping should be cheap).
A local _LocalSDKResult stand-in keeps this branch independent of
sdk_dispatch.py; the real SDKResult is duck-compatible.
Closes #132.
* docs: retro for the #120 Claude-Code-native uplift initiative
Closes the tracking epic with a written retrospective covering:
* what landed (15 children + the no-live-LLM guard PR)
* the architecture delta (subprocess claude -p -> Claude Agent SDK,
methodology in CLAUDE.md, parallel subagents replacing mega-sessions)
* the token-budget delta with each lever and how to verify it on soak
* how the no-structural-tests + no-live-LLM-calls discipline shaped
the design (pluggable seams everywhere)
* what's deferred to soak (criteria that genuinely need a real campaign)
* follow-up work for the next initiative
Closes #120.
* ci: add pytest workflow for push and pull_request
Adds .github/workflows/tests.yml — runs pytest on Python 3.11 + 3.12
for every push to main/reflective and every PR targeting them.
The job intentionally strips OPENAI_API_KEY / OPENAI_BASE_URL /
ANTHROPIC_API_KEY from the runner env. The no-live-LLM project
principle (CLAUDE.md + tests/conftest.py autouse guard) says tests
must never call real LLMs; this CI step is the outer line of defence,
the conftest guard the inner.
Concurrency: in-flight runs on the same PR are cancelled when a new
push lands so we don't burn CI minutes on stale commits.
Flags:
pytest -ra — surface skipped/xfailed in the log so
silent skips don't hide regressions
pytest --strict-markers — fail the build if a test references an
unknown marker. Keeps the test surface
honest.
* ci: drop pull_request base-branch filter so any PR runs CI
Long-running integration branches (e.g. tracking-N) get CI feedback
without contributors having to special-case the base branch in the
workflow.
* docs: pip install + git clone use the reflective branch (#120)
The default branch is main, but reflective is where new work lands
first. Users following the README from a fresh clone of main got an
older Nous than what's actively being developed.
Also documents the optional [sdk] extra for --agent sdk users.
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 49510b3 commit 3e69b85
58 files changed
Lines changed: 7801 additions & 44 deletions
File tree
- .github/workflows
- bin
- docs
- contributing
- retros
- orchestrator
- plugin/nous
- skills
- prompts/methodology
- tests
Some content is hidden
Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
80 | 80 | | |
81 | 81 | | |
82 | 82 | | |
83 | | - | |
| 83 | + | |
84 | 84 | | |
85 | 85 | | |
| 86 | + | |
| 87 | + | |
86 | 88 | | |
87 | 89 | | |
88 | 90 | | |
89 | | - | |
| 91 | + | |
90 | 92 | | |
91 | 93 | | |
92 | 94 | | |
93 | 95 | | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
94 | 102 | | |
95 | 103 | | |
96 | 104 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
0 commit comments