This document describes the internal architecture of the Nous framework: what each component does, how they interact, and the design decisions behind them.
Nous separates deterministic orchestration from AI reasoning. The orchestrator is a Python state machine — it never calls an LLM. It owns phase transitions, checkpointing, gate enforcement, and artifact validation. AI agents are external processes invoked by the orchestrator with structured prompts and schema-governed outputs.
This separation exists because:
- The orchestrator must be auditable and predictable — you need to trust that gates cannot be bypassed, validation runs correctly, and state is always recoverable.
- AI agents are stochastic and expensive — isolating them makes the system testable without LLM calls and lets you swap agent implementations without touching control flow.
┌─────────────────────────────────────┐
│ Orchestrator (Python) │
│ │
│ ┌──────────┐ ┌───────────────┐ │
│ │ Engine │───▶│ state.json │ │
│ │ (states) │ │ (checkpoint) │ │
│ └────┬─────┘ └───────────────┘ │
│ │ │
│ ┌────▼─────┐ ┌───────────────┐ │
│ │ Dispatch │───▶│ Agent (LLM) │ │
│ └────┬─────┘ └───────┬───────┘ │
│ │ │ │
│ │ schema-validated │
│ │ artifacts │
│ │ │ │
│ ┌────▼─────┐ ┌──────▼────────┐ │
│ │ Gates │ │ Fast-Fail │ │
│ │ (human) │ │ (rules) │ │
│ └──────────┘ └───────────────┘ │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Campaign Directory │
│ │
│ campaign.yaml state.json │
│ ledger.json principles.json │
│ runs/iter-N/ │
│ problem.md bundle.yaml │
│ experiment_plan.yaml │
│ execution_results.json │
│ findings.json │
│ principle_updates.json │
│ gate_summary_*.json │
└─────────────────────────────────────┘
The engine owns the 7-state state machine and checkpoint/resume.
State machine:
INIT ──▶ DESIGN ──▶ HUMAN_DESIGN_GATE
▲ │
│ (reject) │ (approve)
└──────────────┘
│
▼
EXECUTE_ANALYZE ──▶ HUMAN_FINDINGS_GATE
▲ │
│ (reject) │ (approve)
└──────────────────────────────┘
│
▼
DONE
│
└──▶ DESIGN (next iteration, counter increments)
Valid transitions:
- INIT → DESIGN
- DESIGN → HUMAN_DESIGN_GATE
- HUMAN_DESIGN_GATE → EXECUTE_ANALYZE (approve) | DESIGN (reject)
- EXECUTE_ANALYZE → HUMAN_FINDINGS_GATE
- HUMAN_FINDINGS_GATE → DONE (approve) | EXECUTE_ANALYZE (reject)
- DONE → DESIGN (next iteration, increments counter)
Key behaviors:
transition(to_state)validates against the transition table, updates the timestamp, setslast_entered_phase, and atomically writesstate.json. Both fields update only on phase entry — artifact writes within a phase do not refresh them (#236), so operators pollingstate.jsonfor progress see entry-time values linger throughout long phases. Watch artifact mtimes for sub-second progress instead.- Iteration counter increments only on the DONE → DESIGN transition (starting a new iteration). Loopbacks from HUMAN_DESIGN_GATE → DESIGN (reject) do NOT increment — they are revisions within the same iteration.
- The DONE state allows transition to DESIGN for the next iteration.
Atomic writes: State is written to a temporary file, fsynced, then renamed over state.json. This prevents data loss if the process crashes mid-write. The in-memory state is only updated after the disk write succeeds, so state never diverges.
The dispatcher invokes AI agents by role and phase, passing structured input and writing schema-validated output.
Agent roles:
| Role | Invoked During | Produces |
|---|---|---|
| Planner (Opus, Claude Agent SDK) | DESIGN | problem.md, bundle.yaml, handoff_snapshot.md |
| Executor (Sonnet, Claude Agent SDK) | EXECUTE_ANALYZE | experiment_plan.yaml, findings.json, principle_updates.json, patches/, results/ |
Both agents write artifacts directly to the campaign directory (iter_dir) and run nous validate before claiming done. If validation fails, the agent reads the errors, fixes the artifacts, and retries. The orchestrator runs a post-check as a safety net.
Validation CLI (orchestrator/validate.py):
nous validate design --dir <iter_dir>— checks problem.md, bundle.yaml (schema), handoff_snapshot.mdnous validate execution --dir <iter_dir>— checks experiment_plan.yaml (schema), findings.json (schema), principle_updates.json, patches (when code_changes exist), input and output files referenced in plan
Implementations:
StubDispatcher(dispatch.py) produces valid, schema-conformant artifacts without calling any LLM. Used for testing the orchestrator loop.SDKDispatcher(sdk_dispatch.py, default and only user-facing code-access backend post-#183) calls the Claude Agent SDK (claude-agent-sdk) directly, giving agents code access and shell tools through native streaming, programmatic prompt caching, and message-level retry. Agents write files directly toiter_dir. Selected via--agent sdk(the default). Requiresclaude-agent-sdkandanyio, both required dependencies ofnoussopip install nousis sufficient.CLIDispatcher(cli_dispatch.py) is retained as a private base class thatSDKDispatcherinherits from for the parse / validate / retry-with-feedback machinery. The legacy--agent api(claude -p subprocess) path was removed in #183; the class is no longer reachable from the CLI.
Dispatch interface:
dispatcher.dispatch(
role="executor", # which agent
phase="execute-analyze", # which phase
output_path=path, # where to write
iteration=1, # current iteration
)All three dispatchers share the same interface. CLIDispatcher extends LLMDispatcher; SDKDispatcher extends CLIDispatcher and overrides only _call_claude and preflight_check.
Claude Code Stop hooks fire after every agent turn and decide whether the agent is allowed to terminate. bin/nous-execute-stop is Nous's deterministic completion check: the executor is allowed to stop only when both conditions hold on disk, no LLM judgment involved:
principle_updates.jsonexists in the iteration directory.nous validate execution --dir $NOUS_ITER_DIRreturnsstatus: pass.
If either fails, the hook exits with code 2 and writes a structured reason to stderr; Claude Code feeds that reason back into the agent's conversation so it can fix the artifact and try again. Wire-up lives in the per-campaign .claude/settings.json (see #135) — the orchestrator exports NOUS_ITER_DIR before launching the executor session.
This is preferred over a probabilistic Haiku evaluator anywhere the success criterion is a schema check: cheaper, faster, and immune to evaluator drift.
SDKDispatcher (--agent sdk, the default) invokes the Claude Agent SDK for both agent roles. The legacy --agent api (claude -p subprocess) backend was removed in #183; only the SDK path is reachable from the CLI.
Pre-flight check: At campaign start, Nous validates that the SDK is importable and credentials work. Environment problems are caught in seconds, not hours into an overnight run.
All failures are retried with exponential backoff (5s → 30s → 120s → 300s → 600s). There is no permanent/transient classification — the only hard failures are CLI-not-found and repo-path-missing, which are caught before the retry loop. Configurable via --max-cli-retries (default 10) and --timeout (default 1800s).
On timeout/max-turns retries, the prompt is enriched with a continuation note so the agent checks for existing artifacts and picks up where it left off. The experiment worktree and iter_dir artifacts are preserved across retries.
Failure persistence: Each retry event is appended to retry_log.jsonl in the campaign directory (timestamp, phase, failure_type, attempt, error).
Campaign-level resilience: If an iteration fails after retries are exhausted, it is recorded as FAILED in ledger.json and the campaign continues to the next iteration.
Prompts are templates in prompts/methodology/ (one per role). At dispatch time, PromptLoader renders each template by replacing {{placeholder}} markers with domain-specific context from campaign.yaml:
{{target_system}},{{system_description}}— fromcampaign.yaml{{observable_metrics}},{{controllable_knobs}}— fromcampaign.yaml{{active_principles}}— formatted fromprinciples.json- Phase-specific context:
{{bundle_yaml}},{{findings_json}}
The executor agent (Sonnet, via the Claude Agent SDK) handles the entire execution pipeline in a single session:
- Receives the approved hypothesis bundle
- Explores the target repo, discovers build commands
- Produces
experiment_plan.yamlwith exact shell commands per arm - Runs the commands, captures stdout/stderr per condition
- Compares observed metrics against predictions
- Produces
findings.jsonandprinciple_updates.json
After execution, the orchestrator validates artifacts (schema check) and merges principles by ID into principles.json.
Two Claude Agent SDK calls per iteration:
| Phase | Model | Role |
|---|---|---|
| DESIGN | Opus | Planner — explores, frames, designs hypothesis bundle |
| EXECUTE_ANALYZE | Sonnet | Executor — builds, patches, runs, analyzes, extracts |
With CLIDispatcher, a campaign configuration can be as simple as:
research_question: "What drives latency in my system?"
target_system:
name: "My System"
description: "A service that processes requests."
repo_path: /path/to/repoThe planner explores the codebase to discover observable metrics, controllable knobs, and execution methods. The full campaign format (with explicit metrics and knobs) remains supported — provided values take precedence over what the planner discovers.
When using CLIDispatcher, the planner can include optional code_changes in bundle arms:
arms:
- type: h-main
prediction: "TTFT decreases by 15-25%"
mechanism: "SJF reorders by predicted compute cost"
diagnostic: "Check scheduling order"
code_changes:
- file: scheduler/policy.go
intent: "Replace FCFS with shortest-job-first"
rationale: "Prefix-heavy requests have predictable cost"The planner says what and why — the executor implements the actual changes in a git worktree.
Deterministic module that appends a schema-conformant row to ledger.json after each iteration. Reads findings.json, bundle.yaml, and principles.json to extract: h_main_result, ablation_results, control_result, robustness_result, prediction accuracy, and principle changes. No LLM calls — purely deterministic computation.
Human gates are hard stops that cannot be bypassed. They surface the artifact and review summaries, then wait for a decision.
Valid decisions:
approve— advance to the next phasereject— loop back (HUMAN_DESIGN_GATE → DESIGN, HUMAN_FINDINGS_GATE → EXECUTE_ANALYZE)abort— end the campaign
Testing modes: auto_approve=True or auto_response="reject" for deterministic testing without human interaction.
Where gates appear:
- HUMAN_DESIGN_GATE — after DESIGN, human sees the hypothesis bundle
- HUMAN_FINDINGS_GATE — after EXECUTE_ANALYZE, human sees findings and principle updates
Before each human gate, a formatted summary (gate_summary_*.json) is produced. The summary includes a plain-language description and bullet points highlighting what matters for the decision.
Gates display the summary first, then the raw artifact (for those who want full detail).
Spec-fidelity diff (#249 / F4). For the design-phase summary,
_augment_summary_with_spec_diff (in orchestrator/iteration.py)
post-processes the LLM-generated summary to attach a deterministic
campaign_spec_diff block: locked_parameters violations, depth_overrides
presence, declared workload changes. Always emitted, regardless of
--auto-approve. nous status surfaces it in human-readable
form.
Two pure-Python validators close the gap between self-consistency (the executor matches the bundle) and spec-fidelity (the bundle matches the campaign):
_validate_locked_parameters(#246 / F1) — every entry incampaign.locked_parametersmust matchbundle.experiment_spec.verified_parametersexactly. Hard-fail regardless of--auto-approve._validate_locked_workload(#265 / F20) — walks the canonical workload structure and diffs againstbundle.inputs/*.yaml. Declared deviations (bundle.workload_changes_from_canonical) are allowed; undeclared are hard-fails.
compute_campaign_spec_diff exposes the same logic for read-only
auditor use (the F4 gate-summary diff). See
docs/campaign-authoring-guide.md for the discipline these enforce.
capture_reproducibility_metadata (#262 / F17) runs at INIT and
records target repo HEAD, dirty flag, hardware-config sha,
language versions, gpu_memory_utilization, latency-config file
paths. The block is persisted in state.json (first-capture wins;
re-running INIT preserves the original) and surfaced via nous status. Per-iteration snapshot_iter_files copies the actual
hardware/latency config files into runs/iter-N/snapshots/ so a
future reviewer can diff exact numbers even after the operator
edits the source-of-truth file.
emit_cumulative_patch (#266 / F21) runs at iteration completion,
before the experiment branch is destroyed, capturing
git diff <main>..<branch> to runs/iter-N/patches/cumulative.patch.
Future campaigns reuse it via:
derived_from:
campaign: paper-memorytime-mirage
iteration: 2 # or "final"apply_derived_from_patch resolves and applies the cumulative
patch to every experiment worktree as a preflight. nous lineage <run_id> surfaces the inheritance chain.
_resolve_turn_silence_threshold(phase) (#264 / F19) walks the
resolution chain — bundle per-phase override → bundle scalar
override → campaign per-phase value → phase default
(design=600, execute_analyze=120, report=240). DESIGN's heavy
reasoning between tool calls earns a longer threshold than
EXECUTE_ANALYZE's frequent simulator calls, eliminating the active
stall observed in paper-memorytime-mirage iter-3.
invoke_plot_specs (#263 / F18) reads campaign.plot_specs,
invokes each user-supplied figure script with NOUS_RESULTS_DIR
and NOUS_FIGURES_DIR environment variables. nous package
tarballs work_dir + reproduce.sh + Dockerfile + README using the
F17 reproducibility metadata.
Planner (Opus)
│
▼
problem.md + bundle.yaml
│
▼
HUMAN_DESIGN_GATE (approve/reject/abort)
│
▼
Executor (Sonnet)
│
▼
experiment_plan.yaml + execution_results.json
+ findings.json + principle_updates.json
│
▼
HUMAN_FINDINGS_GATE (approve/reject/abort)
│
▼
principles.json (upsert by ID)
│
▼
DONE
Iteration 1 Iteration 2 Iteration N
┌──────────────────┐ ┌──────────────────┐ ┌──────────────┐
│ Design │ │ Design │ │ │
│ Execute │ ───▶ │ (constrained by │ ───▶ │ ... │
│ Extract │ │ principles) │ │ │
│ → 2 principles │ │ Execute │ │ │
│ │ │ Extract │ │ │
│ │ │ → 1 new, │ │ │
│ │ │ 1 updated │ │ │
└──────────────────┘ └──────────────────┘ └──────────────┘
principles.json grows and refines over time:
iter 1: [P1, P2]
iter 2: [P1, P2', P3] (P2 updated, P3 inserted)
iter 3: [P1, P2', P4] (P3 pruned, P4 inserted)
Principles are hard constraints: the Planner must not design bundles that contradict active principles without explicit justification.
run_campaign.py loops through iterations:
for i in 1..max_iterations:
┌───────────────────────────────────────────────────────────┐
│ run_iteration(iteration=i) │
│ DESIGN → HUMAN_DESIGN_GATE → EXECUTE_ANALYZE │
│ → HUMAN_FINDINGS_GATE → DONE │
└─────────────────────┬─────────────────────────────────────┘
│
(if not final)
│
append_ledger_row(i)
│
engine.transition("DESIGN")
(increments iteration counter)
│
next iteration
(principles injected into design prompt)
The deterministic ledger (orchestrator/ledger.py) appends one row per iteration with prediction accuracy and principle changes, without any LLM calls.
Every artifact exchanged between components is validated against a JSON Schema (Draft 2020-12). This ensures agents produce well-formed output and makes the system testable without LLMs.
| Schema | Format | Governs |
|---|---|---|
campaign.schema.yaml |
YAML | Campaign configuration (target system, prompt layers) |
state.schema.json |
JSON | Orchestrator checkpoint (phase, iteration, run_id, config_ref) |
bundle.schema.yaml |
YAML | Hypothesis bundles (arms with predictions, mechanisms, diagnostics) |
experiment_plan.schema.yaml |
YAML | Experiment plans (exact commands per arm/condition) |
findings.schema.json |
JSON | Prediction-vs-outcome tables with error classification |
principles.schema.json |
JSON | Principle store (statement, confidence, regime, evidence, category, status) |
ledger.schema.json |
JSON | Append-only iteration log with prediction accuracy and domain metrics |
The bundle and campaign schemas use YAML format because they contain free-text fields that are more readable in YAML. All other schemas use JSON.
Automated AI reviews (DESIGN_REVIEW, FINDINGS_REVIEW) have been removed. Quality control is now handled by:
- HUMAN_DESIGN_GATE — the human reviews the hypothesis bundle directly after DESIGN
- HUMAN_FINDINGS_GATE — the human reviews findings and principle updates after EXECUTE_ANALYZE
This removes the multi-perspective automated review overhead while keeping humans in the loop at both decision points.
When a prediction is wrong, the error type determines what the system learns:
| Error Type | Meaning | System Response |
|---|---|---|
| Direction | Mechanism is fundamentally wrong | Prune or heavily revise the principle |
| Magnitude | Right mechanism, wrong strength | Update principle with calibrated bounds |
| Regime | Works under different conditions | Update principle with correct regime boundaries |
Direction errors are the most serious and most valuable — they reveal where the causal model is fundamentally flawed. In the BLIS case study, a direction error in iteration 1 (predicting <10% degradation, observing 62.4% degradation) redirected the entire scheduling investigation toward admission control.
The orchestrator is designed for crash-safe operation:
- Atomic state writes:
state.jsonis written to a temp file, fsynced, then renamed. A crash during write leaves the previous valid state intact. - Checkpoint/resume: The engine loads state from
state.jsonon construction. Kill the process at any point and restart — it resumes from the last committed state. - Append-only ledger:
ledger.jsonis logically append-only — rows are never modified or deleted. Implementation reads, appends, and atomically rewrites the file. - Idempotent principle merge: The principle merge step reads the existing
principles.json, upserts principles by ID, and writes back. Re-running for the same iteration produces a duplicate (detectable by ID) rather than corruption.
Nous ships with three dispatchers:
StubDispatcher— deterministic stubs for testing.InlineDispatcher— emits prompts to stdout for an enclosing agent framework (no subprocess, no API key).SDKDispatcher— real agent calls via the Claude Agent SDK (default and only user-facing code-access path post-#183).
To create a custom dispatcher, extend LLMDispatcher. Your dispatcher must produce artifacts that pass schema validation — the orchestrator trusts the schema contract, not the content.
- Add the type to the
enuminschemas/bundle.schema.yaml(arm type) andschemas/findings.schema.json(arm_type) - Add test cases to
tests/test_schemas.py