Skip to content

Latest commit

 

History

History
456 lines (341 loc) · 24.8 KB

File metadata and controls

456 lines (341 loc) · 24.8 KB

Architecture

This document describes the internal architecture of the Nous framework: what each component does, how they interact, and the design decisions behind them.

Design Philosophy

Nous separates deterministic orchestration from AI reasoning. The orchestrator is a Python state machine — it never calls an LLM. It owns phase transitions, checkpointing, gate enforcement, and artifact validation. AI agents are external processes invoked by the orchestrator with structured prompts and schema-governed outputs.

This separation exists because:

  • The orchestrator must be auditable and predictable — you need to trust that gates cannot be bypassed, validation runs correctly, and state is always recoverable.
  • AI agents are stochastic and expensive — isolating them makes the system testable without LLM calls and lets you swap agent implementations without touching control flow.

System Overview

                    ┌─────────────────────────────────────┐
                    │          Orchestrator (Python)       │
                    │                                      │
                    │  ┌──────────┐    ┌───────────────┐  │
                    │  │  Engine   │───▶│  state.json   │  │
                    │  │ (states)  │    │  (checkpoint)  │  │
                    │  └────┬─────┘    └───────────────┘  │
                    │       │                              │
                    │  ┌────▼─────┐    ┌───────────────┐  │
                    │  │ Dispatch │───▶│  Agent (LLM)  │  │
                    │  └────┬─────┘    └───────┬───────┘  │
                    │       │                  │           │
                    │       │          schema-validated    │
                    │       │            artifacts         │
                    │       │                  │           │
                    │  ┌────▼─────┐    ┌──────▼────────┐  │
                    │  │  Gates   │    │  Fast-Fail    │  │
                    │  │ (human)  │    │  (rules)      │  │
                    │  └──────────┘    └───────────────┘  │
                    └─────────────────────────────────────┘

                    ┌─────────────────────────────────────┐
                    │           Campaign Directory         │
                    │                                      │
                    │  campaign.yaml   state.json          │
                    │  ledger.json     principles.json     │
                    │  runs/iter-N/                        │
                    │    problem.md    bundle.yaml          │
                    │    experiment_plan.yaml               │
                    │    execution_results.json              │
                    │    findings.json                      │
                    │    principle_updates.json             │
                    │    gate_summary_*.json                │
                    └─────────────────────────────────────┘

Components

Engine (orchestrator/engine.py)

The engine owns the 7-state state machine and checkpoint/resume.

State machine:

INIT ──▶ DESIGN ──▶ HUMAN_DESIGN_GATE
            ▲              │
            │ (reject)     │ (approve)
            └──────────────┘
                           │
                           ▼
                    EXECUTE_ANALYZE ──▶ HUMAN_FINDINGS_GATE
                           ▲                              │
                           │ (reject)                     │ (approve)
                           └──────────────────────────────┘
                                                          │
                                                          ▼
                                                        DONE
                                                          │
                                                          └──▶ DESIGN (next iteration, counter increments)

Valid transitions:

  • INIT → DESIGN
  • DESIGN → HUMAN_DESIGN_GATE
  • HUMAN_DESIGN_GATE → EXECUTE_ANALYZE (approve) | DESIGN (reject)
  • EXECUTE_ANALYZE → HUMAN_FINDINGS_GATE
  • HUMAN_FINDINGS_GATE → DONE (approve) | EXECUTE_ANALYZE (reject)
  • DONE → DESIGN (next iteration, increments counter)

Key behaviors:

  • transition(to_state) validates against the transition table, updates the timestamp, sets last_entered_phase, and atomically writes state.json. Both fields update only on phase entry — artifact writes within a phase do not refresh them (#236), so operators polling state.json for progress see entry-time values linger throughout long phases. Watch artifact mtimes for sub-second progress instead.
  • Iteration counter increments only on the DONE → DESIGN transition (starting a new iteration). Loopbacks from HUMAN_DESIGN_GATE → DESIGN (reject) do NOT increment — they are revisions within the same iteration.
  • The DONE state allows transition to DESIGN for the next iteration.

Atomic writes: State is written to a temporary file, fsynced, then renamed over state.json. This prevents data loss if the process crashes mid-write. The in-memory state is only updated after the disk write succeeds, so state never diverges.

Dispatch (orchestrator/dispatch.py)

The dispatcher invokes AI agents by role and phase, passing structured input and writing schema-validated output.

Agent roles:

Role Invoked During Produces
Planner (Opus, Claude Agent SDK) DESIGN problem.md, bundle.yaml, handoff_snapshot.md
Executor (Sonnet, Claude Agent SDK) EXECUTE_ANALYZE experiment_plan.yaml, findings.json, principle_updates.json, patches/, results/

Both agents write artifacts directly to the campaign directory (iter_dir) and run nous validate before claiming done. If validation fails, the agent reads the errors, fixes the artifacts, and retries. The orchestrator runs a post-check as a safety net.

Validation CLI (orchestrator/validate.py):

  • nous validate design --dir <iter_dir> — checks problem.md, bundle.yaml (schema), handoff_snapshot.md
  • nous validate execution --dir <iter_dir> — checks experiment_plan.yaml (schema), findings.json (schema), principle_updates.json, patches (when code_changes exist), input and output files referenced in plan

Implementations:

  • StubDispatcher (dispatch.py) produces valid, schema-conformant artifacts without calling any LLM. Used for testing the orchestrator loop.
  • SDKDispatcher (sdk_dispatch.py, default and only user-facing code-access backend post-#183) calls the Claude Agent SDK (claude-agent-sdk) directly, giving agents code access and shell tools through native streaming, programmatic prompt caching, and message-level retry. Agents write files directly to iter_dir. Selected via --agent sdk (the default). Requires claude-agent-sdk and anyio, both required dependencies of nous so pip install nous is sufficient.
  • CLIDispatcher (cli_dispatch.py) is retained as a private base class that SDKDispatcher inherits from for the parse / validate / retry-with-feedback machinery. The legacy --agent api (claude -p subprocess) path was removed in #183; the class is no longer reachable from the CLI.

Dispatch interface:

dispatcher.dispatch(
    role="executor",           # which agent
    phase="execute-analyze",   # which phase
    output_path=path,          # where to write
    iteration=1,               # current iteration
)

All three dispatchers share the same interface. CLIDispatcher extends LLMDispatcher; SDKDispatcher extends CLIDispatcher and overrides only _call_claude and preflight_check.

Stop Hook (bin/nous-execute-stop)

Claude Code Stop hooks fire after every agent turn and decide whether the agent is allowed to terminate. bin/nous-execute-stop is Nous's deterministic completion check: the executor is allowed to stop only when both conditions hold on disk, no LLM judgment involved:

  1. principle_updates.json exists in the iteration directory.
  2. nous validate execution --dir $NOUS_ITER_DIR returns status: pass.

If either fails, the hook exits with code 2 and writes a structured reason to stderr; Claude Code feeds that reason back into the agent's conversation so it can fix the artifact and try again. Wire-up lives in the per-campaign .claude/settings.json (see #135) — the orchestrator exports NOUS_ITER_DIR before launching the executor session.

This is preferred over a probabilistic Haiku evaluator anywhere the success criterion is a schema check: cheaper, faster, and immune to evaluator drift.

SDK Dispatch

SDKDispatcher (--agent sdk, the default) invokes the Claude Agent SDK for both agent roles. The legacy --agent api (claude -p subprocess) backend was removed in #183; only the SDK path is reachable from the CLI.

Retry and Resilience

Pre-flight check: At campaign start, Nous validates that the SDK is importable and credentials work. Environment problems are caught in seconds, not hours into an overnight run.

All failures are retried with exponential backoff (5s → 30s → 120s → 300s → 600s). There is no permanent/transient classification — the only hard failures are CLI-not-found and repo-path-missing, which are caught before the retry loop. Configurable via --max-cli-retries (default 10) and --timeout (default 1800s).

On timeout/max-turns retries, the prompt is enriched with a continuation note so the agent checks for existing artifacts and picks up where it left off. The experiment worktree and iter_dir artifacts are preserved across retries.

Failure persistence: Each retry event is appended to retry_log.jsonl in the campaign directory (timestamp, phase, failure_type, attempt, error).

Campaign-level resilience: If an iteration fails after retries are exhausted, it is recorded as FAILED in ledger.json and the campaign continues to the next iteration.

Prompt System

Prompts are templates in prompts/methodology/ (one per role). At dispatch time, PromptLoader renders each template by replacing {{placeholder}} markers with domain-specific context from campaign.yaml:

  • {{target_system}}, {{system_description}} — from campaign.yaml
  • {{observable_metrics}}, {{controllable_knobs}} — from campaign.yaml
  • {{active_principles}} — formatted from principles.json
  • Phase-specific context: {{bundle_yaml}}, {{findings_json}}

EXECUTE_ANALYZE: Merged Execution Pipeline

The executor agent (Sonnet, via the Claude Agent SDK) handles the entire execution pipeline in a single session:

  1. Receives the approved hypothesis bundle
  2. Explores the target repo, discovers build commands
  3. Produces experiment_plan.yaml with exact shell commands per arm
  4. Runs the commands, captures stdout/stderr per condition
  5. Compares observed metrics against predictions
  6. Produces findings.json and principle_updates.json

After execution, the orchestrator validates artifacts (schema check) and merges principles by ID into principles.json.

Model Configuration

Two Claude Agent SDK calls per iteration:

Phase Model Role
DESIGN Opus Planner — explores, frames, designs hypothesis bundle
EXECUTE_ANALYZE Sonnet Executor — builds, patches, runs, analyzes, extracts

Simplified Campaign

With CLIDispatcher, a campaign configuration can be as simple as:

research_question: "What drives latency in my system?"
target_system:
  name: "My System"
  description: "A service that processes requests."
  repo_path: /path/to/repo

The planner explores the codebase to discover observable metrics, controllable knobs, and execution methods. The full campaign format (with explicit metrics and knobs) remains supported — provided values take precedence over what the planner discovers.

Code Change Intents

When using CLIDispatcher, the planner can include optional code_changes in bundle arms:

arms:
  - type: h-main
    prediction: "TTFT decreases by 15-25%"
    mechanism: "SJF reorders by predicted compute cost"
    diagnostic: "Check scheduling order"
    code_changes:
      - file: scheduler/policy.go
        intent: "Replace FCFS with shortest-job-first"
        rationale: "Prefix-heavy requests have predictable cost"

The planner says what and why — the executor implements the actual changes in a git worktree.

Ledger (orchestrator/ledger.py)

Deterministic module that appends a schema-conformant row to ledger.json after each iteration. Reads findings.json, bundle.yaml, and principles.json to extract: h_main_result, ablation_results, control_result, robustness_result, prediction accuracy, and principle changes. No LLM calls — purely deterministic computation.

Gates (orchestrator/gates.py)

Human gates are hard stops that cannot be bypassed. They surface the artifact and review summaries, then wait for a decision.

Valid decisions:

  • approve — advance to the next phase
  • reject — loop back (HUMAN_DESIGN_GATE → DESIGN, HUMAN_FINDINGS_GATE → EXECUTE_ANALYZE)
  • abort — end the campaign

Testing modes: auto_approve=True or auto_response="reject" for deterministic testing without human interaction.

Where gates appear:

  1. HUMAN_DESIGN_GATE — after DESIGN, human sees the hypothesis bundle
  2. HUMAN_FINDINGS_GATE — after EXECUTE_ANALYZE, human sees findings and principle updates

Gate Summaries

Before each human gate, a formatted summary (gate_summary_*.json) is produced. The summary includes a plain-language description and bullet points highlighting what matters for the decision.

Gates display the summary first, then the raw artifact (for those who want full detail).

Spec-fidelity diff (#249 / F4). For the design-phase summary, _augment_summary_with_spec_diff (in orchestrator/iteration.py) post-processes the LLM-generated summary to attach a deterministic campaign_spec_diff block: locked_parameters violations, depth_overrides presence, declared workload changes. Always emitted, regardless of --auto-approve. nous status surfaces it in human-readable form.

Spec fidelity (orchestrator/validate.py)

Two pure-Python validators close the gap between self-consistency (the executor matches the bundle) and spec-fidelity (the bundle matches the campaign):

  • _validate_locked_parameters (#246 / F1) — every entry in campaign.locked_parameters must match bundle.experiment_spec.verified_parameters exactly. Hard-fail regardless of --auto-approve.
  • _validate_locked_workload (#265 / F20) — walks the canonical workload structure and diffs against bundle.inputs/*.yaml. Declared deviations (bundle.workload_changes_from_canonical) are allowed; undeclared are hard-fails.

compute_campaign_spec_diff exposes the same logic for read-only auditor use (the F4 gate-summary diff). See docs/campaign-authoring-guide.md for the discipline these enforce.

Reproducibility metadata (orchestrator/reproducibility.py)

capture_reproducibility_metadata (#262 / F17) runs at INIT and records target repo HEAD, dirty flag, hardware-config sha, language versions, gpu_memory_utilization, latency-config file paths. The block is persisted in state.json (first-capture wins; re-running INIT preserves the original) and surfaced via nous status. Per-iteration snapshot_iter_files copies the actual hardware/latency config files into runs/iter-N/snapshots/ so a future reviewer can diff exact numbers even after the operator edits the source-of-truth file.

Cross-campaign code reuse (orchestrator/lineage.py)

emit_cumulative_patch (#266 / F21) runs at iteration completion, before the experiment branch is destroyed, capturing git diff <main>..<branch> to runs/iter-N/patches/cumulative.patch. Future campaigns reuse it via:

derived_from:
  campaign: paper-memorytime-mirage
  iteration: 2          # or "final"

apply_derived_from_patch resolves and applies the cumulative patch to every experiment worktree as a preflight. nous lineage <run_id> surfaces the inheritance chain.

Per-phase silence threshold (orchestrator/sdk_dispatch.py)

_resolve_turn_silence_threshold(phase) (#264 / F19) walks the resolution chain — bundle per-phase override → bundle scalar override → campaign per-phase value → phase default (design=600, execute_analyze=120, report=240). DESIGN's heavy reasoning between tool calls earns a longer threshold than EXECUTE_ANALYZE's frequent simulator calls, eliminating the active stall observed in paper-memorytime-mirage iter-3.

Plot specs + paper packaging (orchestrator/plot_specs.py, nous package)

invoke_plot_specs (#263 / F18) reads campaign.plot_specs, invokes each user-supplied figure script with NOUS_RESULTS_DIR and NOUS_FIGURES_DIR environment variables. nous package tarballs work_dir + reproduce.sh + Dockerfile + README using the F17 reproducibility metadata.

Data Flow

Within One Iteration

                    Planner (Opus)
                       │
                       ▼
              problem.md + bundle.yaml
                       │
                       ▼
              HUMAN_DESIGN_GATE (approve/reject/abort)
                       │
                       ▼
                  Executor (Sonnet)
                       │
                       ▼
         experiment_plan.yaml + execution_results.json
         + findings.json + principle_updates.json
                       │
                       ▼
              HUMAN_FINDINGS_GATE (approve/reject/abort)
                       │
                       ▼
              principles.json (upsert by ID)
                       │
                       ▼
                     DONE

Across Iterations

Iteration 1                    Iteration 2                    Iteration N
┌──────────────────┐          ┌──────────────────┐          ┌──────────────┐
│ Design           │          │ Design           │          │              │
│ Execute          │   ───▶   │  (constrained by │   ───▶   │   ...        │
│ Extract          │          │   principles)    │          │              │
│  → 2 principles  │          │ Execute          │          │              │
│                  │          │ Extract          │          │              │
│                  │          │  → 1 new,        │          │              │
│                  │          │    1 updated     │          │              │
└──────────────────┘          └──────────────────┘          └──────────────┘

principles.json grows and refines over time:
  iter 1: [P1, P2]
  iter 2: [P1, P2', P3]       (P2 updated, P3 inserted)
  iter 3: [P1, P2', P4]       (P3 pruned, P4 inserted)

Principles are hard constraints: the Planner must not design bundles that contradict active principles without explicit justification.

Multi-Iteration Campaign Flow

run_campaign.py loops through iterations:

for i in 1..max_iterations:
  ┌───────────────────────────────────────────────────────────┐
  │  run_iteration(iteration=i)                               │
  │    DESIGN → HUMAN_DESIGN_GATE → EXECUTE_ANALYZE           │
  │    → HUMAN_FINDINGS_GATE → DONE                           │
  └─────────────────────┬─────────────────────────────────────┘
                        │
                  (if not final)
                        │
              append_ledger_row(i)
                        │
              engine.transition("DESIGN")
                  (increments iteration counter)
                        │
                    next iteration
                  (principles injected into design prompt)

The deterministic ledger (orchestrator/ledger.py) appends one row per iteration with prediction accuracy and principle changes, without any LLM calls.

Schema Contracts

Every artifact exchanged between components is validated against a JSON Schema (Draft 2020-12). This ensures agents produce well-formed output and makes the system testable without LLMs.

Schema Format Governs
campaign.schema.yaml YAML Campaign configuration (target system, prompt layers)
state.schema.json JSON Orchestrator checkpoint (phase, iteration, run_id, config_ref)
bundle.schema.yaml YAML Hypothesis bundles (arms with predictions, mechanisms, diagnostics)
experiment_plan.schema.yaml YAML Experiment plans (exact commands per arm/condition)
findings.schema.json JSON Prediction-vs-outcome tables with error classification
principles.schema.json JSON Principle store (statement, confidence, regime, evidence, category, status)
ledger.schema.json JSON Append-only iteration log with prediction accuracy and domain metrics

The bundle and campaign schemas use YAML format because they contain free-text fields that are more readable in YAML. All other schemas use JSON.

Human Review

Automated AI reviews (DESIGN_REVIEW, FINDINGS_REVIEW) have been removed. Quality control is now handled by:

  1. HUMAN_DESIGN_GATE — the human reviews the hypothesis bundle directly after DESIGN
  2. HUMAN_FINDINGS_GATE — the human reviews findings and principle updates after EXECUTE_ANALYZE

This removes the multi-perspective automated review overhead while keeping humans in the loop at both decision points.

Prediction Error Taxonomy

When a prediction is wrong, the error type determines what the system learns:

Error Type Meaning System Response
Direction Mechanism is fundamentally wrong Prune or heavily revise the principle
Magnitude Right mechanism, wrong strength Update principle with calibrated bounds
Regime Works under different conditions Update principle with correct regime boundaries

Direction errors are the most serious and most valuable — they reveal where the causal model is fundamentally flawed. In the BLIS case study, a direction error in iteration 1 (predicting <10% degradation, observing 62.4% degradation) redirected the entire scheduling investigation toward admission control.

Crash Safety and Recovery

The orchestrator is designed for crash-safe operation:

  • Atomic state writes: state.json is written to a temp file, fsynced, then renamed. A crash during write leaves the previous valid state intact.
  • Checkpoint/resume: The engine loads state from state.json on construction. Kill the process at any point and restart — it resumes from the last committed state.
  • Append-only ledger: ledger.json is logically append-only — rows are never modified or deleted. Implementation reads, appends, and atomically rewrites the file.
  • Idempotent principle merge: The principle merge step reads the existing principles.json, upserts principles by ID, and writes back. Re-running for the same iteration produces a duplicate (detectable by ID) rather than corruption.

Extending Nous

Using a Different Dispatcher

Nous ships with three dispatchers:

  • StubDispatcher — deterministic stubs for testing.
  • InlineDispatcher — emits prompts to stdout for an enclosing agent framework (no subprocess, no API key).
  • SDKDispatcher — real agent calls via the Claude Agent SDK (default and only user-facing code-access path post-#183).

To create a custom dispatcher, extend LLMDispatcher. Your dispatcher must produce artifacts that pass schema validation — the orchestrator trusts the schema contract, not the content.

Adding a New Arm Type

  1. Add the type to the enum in schemas/bundle.schema.yaml (arm type) and schemas/findings.schema.json (arm_type)
  2. Add test cases to tests/test_schemas.py