This document describes the internal architecture of the Nous framework: what each component does, how they interact, and the design decisions behind them.
Nous separates deterministic orchestration from AI reasoning. The orchestrator is a Python state machine — it never calls an LLM. It owns phase transitions, checkpointing, gate enforcement, and artifact validation. AI agents are external processes invoked by the orchestrator with structured prompts and schema-governed outputs.
This separation exists because:
- The orchestrator must be auditable and predictable — you need to trust that gates cannot be bypassed, validation runs correctly, and state is always recoverable.
- AI agents are stochastic and expensive — isolating them makes the system testable without LLM calls and lets you swap agent implementations without touching control flow.
┌─────────────────────────────────────┐
│ Orchestrator (Python) │
│ │
│ ┌──────────┐ ┌───────────────┐ │
│ │ Engine │───▶│ state.json │ │
│ │ (states) │ │ (checkpoint) │ │
│ └────┬─────┘ └───────────────┘ │
│ │ │
│ ┌────▼─────┐ ┌───────────────┐ │
│ │ Dispatch │───▶│ Agent (LLM) │ │
│ └────┬─────┘ └───────┬───────┘ │
│ │ │ │
│ │ schema-validated │
│ │ artifacts │
│ │ │ │
│ ┌────▼─────┐ ┌──────▼────────┐ │
│ │ Gates │ │ Fast-Fail │ │
│ │ (human) │ │ (rules) │ │
│ └──────────┘ └───────────────┘ │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Campaign Directory │
│ │
│ campaign.yaml state.json │
│ ledger.json principles.json │
│ runs/iter-N/ │
│ problem.md bundle.yaml │
│ experiment_plan.yaml │
│ execution_results.json │
│ findings.json │
│ principle_updates.json │
│ gate_summary_*.json │
└─────────────────────────────────────┘
The engine owns the 7-state state machine and checkpoint/resume.
State machine:
INIT ──▶ DESIGN ──▶ HUMAN_DESIGN_GATE
▲ │
│ (reject) │ (approve)
└──────────────┘
│
▼
EXECUTE_ANALYZE ──▶ HUMAN_FINDINGS_GATE
▲ │
│ (reject) │ (approve)
└──────────────────────────────┘
│
▼
DONE
│
└──▶ DESIGN (next iteration, counter increments)
Valid transitions:
- INIT → DESIGN
- DESIGN → HUMAN_DESIGN_GATE
- HUMAN_DESIGN_GATE → EXECUTE_ANALYZE (approve) | DESIGN (reject)
- EXECUTE_ANALYZE → HUMAN_FINDINGS_GATE
- HUMAN_FINDINGS_GATE → DONE (approve) | EXECUTE_ANALYZE (reject)
- DONE → DESIGN (next iteration, increments counter)
Key behaviors:
transition(to_state)validates against the transition table, updates the timestamp, and atomically writesstate.json.- Iteration counter increments only on the DONE → DESIGN transition (starting a new iteration). Loopbacks from HUMAN_DESIGN_GATE → DESIGN (reject) do NOT increment — they are revisions within the same iteration.
- The DONE state allows transition to DESIGN for the next iteration.
Atomic writes: State is written to a temporary file, fsynced, then renamed over state.json. This prevents data loss if the process crashes mid-write. The in-memory state is only updated after the disk write succeeds, so state never diverges.
The dispatcher invokes AI agents by role and phase, passing structured input and writing schema-validated output.
Agent roles:
| Role | Invoked During | Produces |
|---|---|---|
Planner (Opus, claude -p) |
DESIGN | problem.md, bundle.yaml, handoff_snapshot.md |
Executor (Sonnet, claude -p) |
EXECUTE_ANALYZE | experiment_plan.yaml, findings.json, principle_updates.json, patches/, results/ |
Both agents write artifacts directly to the campaign directory (iter_dir) and run nous validate before claiming done. If validation fails, the agent reads the errors, fixes the artifacts, and retries. The orchestrator runs a post-check as a safety net.
Validation CLI (orchestrator/validate.py):
nous validate design --dir <iter_dir>— checks problem.md, bundle.yaml (schema), handoff_snapshot.mdnous validate execution --dir <iter_dir>— checks experiment_plan.yaml (schema), findings.json (schema), principle_updates.json, patches (when code_changes exist), input and output files referenced in plan
Implementations:
StubDispatcher(dispatch.py) produces valid, schema-conformant artifacts without calling any LLM. Used for testing the orchestrator loop.CLIDispatcher(cli_dispatch.py) invokesclaude -pas a subprocess, giving agents code access and shell tools. Agents write files directly toiter_dir. Supportsoverride_cwd()context manager for pointing the executor at a git worktree.
Dispatch interface:
dispatcher.dispatch(
role="executor", # which agent
phase="execute-analyze", # which phase
output_path=path, # where to write
iteration=1, # current iteration
)Both dispatchers share the same interface — CLIDispatcher extends LLMDispatcher.
CLIDispatcher invokes claude -p for both agent roles.
Prompts are templates in prompts/methodology/ (one per role). At dispatch time, PromptLoader renders each template by replacing {{placeholder}} markers with domain-specific context from campaign.yaml:
{{target_system}},{{system_description}}— fromcampaign.yaml{{observable_metrics}},{{controllable_knobs}}— fromcampaign.yaml{{active_principles}}— formatted fromprinciples.json- Phase-specific context:
{{bundle_yaml}},{{findings_json}}
The executor agent (Sonnet, claude -p) handles the entire execution pipeline in a single session:
- Receives the approved hypothesis bundle
- Explores the target repo, discovers build commands
- Produces
experiment_plan.yamlwith exact shell commands per arm - Runs the commands, captures stdout/stderr per condition
- Compares observed metrics against predictions
- Produces
findings.jsonandprinciple_updates.json
After execution, the orchestrator validates artifacts (schema check) and merges principles by ID into principles.json.
Two claude -p calls per iteration:
| Phase | Model | Role |
|---|---|---|
| DESIGN | Opus | Planner — explores, frames, designs hypothesis bundle |
| EXECUTE_ANALYZE | Sonnet | Executor — builds, patches, runs, analyzes, extracts |
With CLIDispatcher, a campaign configuration can be as simple as:
research_question: "What drives latency in my system?"
target_system:
name: "My System"
description: "A service that processes requests."
repo_path: /path/to/repoThe planner explores the codebase to discover observable metrics, controllable knobs, and execution methods. The full campaign format (with explicit metrics and knobs) remains supported — provided values take precedence over what the planner discovers.
When using CLIDispatcher, the planner can include optional code_changes in bundle arms:
arms:
- type: h-main
prediction: "TTFT decreases by 15-25%"
mechanism: "SJF reorders by predicted compute cost"
diagnostic: "Check scheduling order"
code_changes:
- file: scheduler/policy.go
intent: "Replace FCFS with shortest-job-first"
rationale: "Prefix-heavy requests have predictable cost"The planner says what and why — the executor implements the actual changes in a git worktree.
Deterministic module that appends a schema-conformant row to ledger.json after each iteration. Reads findings.json, bundle.yaml, and principles.json to extract: h_main_result, ablation_results, control_result, robustness_result, prediction accuracy, and principle changes. No LLM calls — purely deterministic computation.
Human gates are hard stops that cannot be bypassed. They surface the artifact and review summaries, then wait for a decision.
Valid decisions:
approve— advance to the next phasereject— loop back (HUMAN_DESIGN_GATE → DESIGN, HUMAN_FINDINGS_GATE → EXECUTE_ANALYZE)abort— end the campaign
Testing modes: auto_approve=True or auto_response="reject" for deterministic testing without human interaction.
Where gates appear:
- HUMAN_DESIGN_GATE — after DESIGN, human sees the hypothesis bundle
- HUMAN_FINDINGS_GATE — after EXECUTE_ANALYZE, human sees findings and principle updates
Before each human gate, a formatted summary (gate_summary_*.json) is produced. The summary includes a plain-language description and bullet points highlighting what matters for the decision.
Gates display the summary first, then the raw artifact (for those who want full detail).
Planner (Opus)
│
▼
problem.md + bundle.yaml
│
▼
HUMAN_DESIGN_GATE (approve/reject/abort)
│
▼
Executor (Sonnet)
│
▼
experiment_plan.yaml + execution_results.json
+ findings.json + principle_updates.json
│
▼
HUMAN_FINDINGS_GATE (approve/reject/abort)
│
▼
principles.json (upsert by ID)
│
▼
DONE
Iteration 1 Iteration 2 Iteration N
┌──────────────────┐ ┌──────────────────┐ ┌──────────────┐
│ Design │ │ Design │ │ │
│ Execute │ ───▶ │ (constrained by │ ───▶ │ ... │
│ Extract │ │ principles) │ │ │
│ → 2 principles │ │ Execute │ │ │
│ │ │ Extract │ │ │
│ │ │ → 1 new, │ │ │
│ │ │ 1 updated │ │ │
└──────────────────┘ └──────────────────┘ └──────────────┘
principles.json grows and refines over time:
iter 1: [P1, P2]
iter 2: [P1, P2', P3] (P2 updated, P3 inserted)
iter 3: [P1, P2', P4] (P3 pruned, P4 inserted)
Principles are hard constraints: the Planner must not design bundles that contradict active principles without explicit justification.
run_campaign.py loops through iterations:
for i in 1..max_iterations:
┌───────────────────────────────────────────────────────────┐
│ run_iteration(iteration=i) │
│ DESIGN → HUMAN_DESIGN_GATE → EXECUTE_ANALYZE │
│ → HUMAN_FINDINGS_GATE → DONE │
└─────────────────────┬─────────────────────────────────────┘
│
(if not final)
│
append_ledger_row(i)
│
engine.transition("DESIGN")
(increments iteration counter)
│
next iteration
(principles injected into design prompt)
The deterministic ledger (orchestrator/ledger.py) appends one row per iteration with prediction accuracy and principle changes, without any LLM calls.
Every artifact exchanged between components is validated against a JSON Schema (Draft 2020-12). This ensures agents produce well-formed output and makes the system testable without LLMs.
| Schema | Format | Governs |
|---|---|---|
campaign.schema.yaml |
YAML | Campaign configuration (target system, prompt layers) |
state.schema.json |
JSON | Orchestrator checkpoint (phase, iteration, run_id, config_ref) |
bundle.schema.yaml |
YAML | Hypothesis bundles (arms with predictions, mechanisms, diagnostics) |
experiment_plan.schema.yaml |
YAML | Experiment plans (exact commands per arm/condition) |
findings.schema.json |
JSON | Prediction-vs-outcome tables with error classification |
principles.schema.json |
JSON | Principle store (statement, confidence, regime, evidence, category, status) |
ledger.schema.json |
JSON | Append-only iteration log with prediction accuracy and domain metrics |
The bundle and campaign schemas use YAML format because they contain free-text fields that are more readable in YAML. All other schemas use JSON.
Automated AI reviews (DESIGN_REVIEW, FINDINGS_REVIEW) have been removed. Quality control is now handled by:
- HUMAN_DESIGN_GATE — the human reviews the hypothesis bundle directly after DESIGN
- HUMAN_FINDINGS_GATE — the human reviews findings and principle updates after EXECUTE_ANALYZE
This removes the multi-perspective automated review overhead while keeping humans in the loop at both decision points.
When a prediction is wrong, the error type determines what the system learns:
| Error Type | Meaning | System Response |
|---|---|---|
| Direction | Mechanism is fundamentally wrong | Prune or heavily revise the principle |
| Magnitude | Right mechanism, wrong strength | Update principle with calibrated bounds |
| Regime | Works under different conditions | Update principle with correct regime boundaries |
Direction errors are the most serious and most valuable — they reveal where the causal model is fundamentally flawed. In the BLIS case study, a direction error in iteration 1 (predicting <10% degradation, observing 62.4% degradation) redirected the entire scheduling investigation toward admission control.
The orchestrator is designed for crash-safe operation:
- Atomic state writes:
state.jsonis written to a temp file, fsynced, then renamed. A crash during write leaves the previous valid state intact. - Checkpoint/resume: The engine loads state from
state.jsonon construction. Kill the process at any point and restart — it resumes from the last committed state. - Append-only ledger:
ledger.jsonis logically append-only — rows are never modified or deleted. Implementation reads, appends, and atomically rewrites the file. - Idempotent principle merge: The principle merge step reads the existing
principles.json, upserts principles by ID, and writes back. Re-running for the same iteration produces a duplicate (detectable by ID) rather than corruption.
Nous ships with two dispatchers:
StubDispatcher— deterministic stubs for testingCLIDispatcher— real agent calls viaclaude -p
To create a custom dispatcher, extend LLMDispatcher. Your dispatcher must produce artifacts that pass schema validation — the orchestrator trusts the schema contract, not the content.
- Add the type to the
enuminschemas/bundle.schema.yaml(arm type) andschemas/findings.schema.json(arm_type) - Add test cases to
tests/test_schemas.py