Nous — Usage Story, Architecture, and Phase Plan #20

mtoslalibu · 2026-04-17T13:22:53Z

mtoslalibu
Apr 17, 2026
Maintainer

What Is Nous?

Nous turns Claude into a systematic scientist for your codebase. Instead of trial-and-error ("try this, did it work?"), Nous makes Claude predict before acting, test properly, learn from failures, and remember what it learned.

One sentence: The scientific method, implemented as a Claude Code plugin, so your system investigations compound instead of starting fresh every time.

The Gap We're Closing

Engineers have intuitions about their systems but no structured way to turn those intuitions into testable hypotheses, run controlled experiments, analyze why results happened (not just what happened), and carry those lessons forward to the next idea. Today, knowledge lives in people's heads and evaporates between sessions. Nous makes it accumulate.

The Problem

Engineers use Claude Code every day to explore and optimize their systems. It works — Claude is smart. But it has three blind spots:

No predictions. Claude just tries things. If it works, you don't know why. If it fails, you don't know what kind of wrong.
No controls. Did the improvement come from your change, or was the workload just favorable? No ablations, no negative controls, no multi-seed runs.
No memory. Next session, everything is gone. The engineer's head is the only knowledge store.

Nous fixes all three.

How You Use It

Install

/plugin install nous

Three Commands

Command	What it does
`/nous:init`	Explore your repo, propose a configuration, you approve
`/nous:investigate "question"`	Run the scientific loop
`/nous:status`	See what's been learned, what's next

Walkthrough: Investigating a Scheduler

You're working on an LLM serving system. You think priority scheduling might reduce latency. Here's what happens.

1. Set up — `/nous:init`

You open Claude Code in your repo and run /nous:init. Nous explores the codebase and presents:

I've explored the repo. Here's what I propose:

Target system: vLLM inference serving engine
Metrics:       TTFT P50/P99, TPOT, throughput, GPU utilization, KV-cache hit rate
Knobs:         scheduler_policy, max_batch_size, preemption_mode, cache_eviction
Execution:     benchmarks/serving_benchmark.py (supports seeded runs)
Reviewers:     statistical rigor, causal sufficiency, confound risk,
               GPU contention, memory pressure

[Approve] [Adjust] [Add more] [Remove]

You add "swap manager" as a knob, remove "memory pressure" (not relevant here), and approve. Nous writes a config file and creates a .nous/ directory next to your code. Nothing in your repo changes.

2. Investigate — `/nous:investigate`

> /nous:investigate "Can priority-aware scheduling reduce TTFT P99 under mixed workloads?"

Nous designs a hypothesis bundle:

Arm	Prediction	Why
H-main	P99 drops 35%	Short requests skip head-of-line blocking
H-ablation	Without queue reordering, improvement < 10%	Reordering is the key component
H-control-negative	At 20% utilization, no effect	No contention → nothing to fix

You review and approve. You see the bundle and AI review summaries. If something looks off, you reject and Nous revises.

Nous runs the experiment in an isolated git worktree — your branch is untouched. Three seeds per arm. Results:

H-main:    REFUTED — P99 increased 62%!  (direction error)
H-ablation: REFUTED — reordering made things worse
H-control:  REFUTED — effect present at low load too (confounded)

Nous learns from the failure:

Error type: DIRECTION — the mechanism is fundamentally wrong.
Priority scheduling doesn't prevent head-of-line blocking.
It causes convoy effects.

→ Principle extracted:
  RP-1: "Priority scheduling causes convoy effects at >70% GPU utilization"
  Confidence: HIGH

This is the key insight: the prediction was wrong, but we learned exactly why. With plain Claude, you'd just say "didn't work, try something else."

3. Next iteration

Nous is now constrained by RP-1. It won't propose priority scheduling at high load. Instead, it explores admission control — a different mechanism entirely.

Iteration 2: SLO-gated admission control
  H-main: CONFIRMED — P99 dropped 41%
  Mechanism: rejecting requests that would violate SLOs prevents cascade

→ Principle extracted:
  RP-2: "SLO-gated admission control is non-zero-sum at saturation"
  Confidence: HIGH

4. Check progress — `/nous:status`

> /nous:status

Campaign: vllm-scheduling (iteration 4)
Active principles: 7
  RP-1: Priority scheduling causes convoy effects at >70% util [HIGH]
  RP-2: SLO-gated admission control is non-zero-sum at saturation [HIGH]
  RP-3: Preemption interacts with cache eviction [MEDIUM]
  ...
Prediction accuracy: 45% → 62% → 71% → 78%
Open questions: Does admission control hold under heterogeneous model sizes?

The prediction accuracy trend shows Nous building a better model of your system over time.

Walkthrough: A Different System

Nous isn't tied to LLM serving. Here's the same flow on a transaction scheduling system:

> /nous:init

Target system: TxnDB transaction scheduler
Metrics:       txn/s, abort rate, latency P50/P99, lock contention
Knobs:         concurrency protocol, isolation level, partition strategy
Execution:     test/benchmark_txn.py
Reviewers:     + serializability, deadlock risk, skew sensitivity

> /nous:investigate "Does optimistic concurrency control reduce abort rate under skewed workloads?"

Same methodology, same orchestrator, same schemas. Only the configuration changes.

How It Compares

Nous vs Plain Claude

	Plain Claude	With Nous
Having an idea	"Try priority scheduling"	Prediction: "P99 drops 35% because of X"
Running it	One change, one run	Bundle: main + ablation + control, 3+ seeds
When it fails	"Let me try something else"	Classify error, extract principle
When it works	"Ship it"	Why did it work? Which part mattered?
Next session	Start from scratch	Constrained by principles
Months later	"What did we learn?"	30 principles with evidence chains

Nous vs Evolutionary Search (OpenEvolve/ADRS)

	Evolutionary search	Nous
Learns	Better code (the what)	Principles about why
On failure	Discard and mutate	Classify error, extract principle
Reusability	Start over per problem	Principles carry across problems

These are complementary. Evolutionary search finds solutions in large parameter spaces. Nous builds understanding.

What's Underneath: The Architecture

The three commands above are the UX layer. Underneath is a four-layer stack:

┌──────────────────────────────────────────────────────────────┐
│  UX Layer — Claude Code Plugin                               │
│                                                              │
│  /nous:init        /nous:investigate       /nous:status      │
└──────────┬───────────────────┬───────────────────┬───────────┘
           │                   │                   │
           ▼                   ▼                   ▼
┌──────────────────────────────────────────────────────────────┐
│  Agent Layer — Four AI Roles                                 │
│                                                              │
│  Planner          Executor        Reviewer       Extractor   │
│  (hypothesize)    (implement+run) (multi-judge)  (learn)     │
└──────────┬───────────────────┬───────────────────┬───────────┘
           │                   │                   │
           ▼                   ▼                   ▼
┌──────────────────────────────────────────────────────────────┐
│  Orchestrator — Deterministic Python State Machine            │
│                                                              │
│  INIT → FRAMING → DESIGN → DESIGN_REVIEW → HUMAN_GATE →     │
│  RUNNING → FINDINGS_REVIEW → HUMAN_GATE → TUNING →          │
│  EXTRACTION → next iteration or DONE                         │
│                                                              │
│  NOT an LLM. Auditable, predictable, crash-safe.             │
└──────────┬───────────────────┬───────────────────┬───────────┘
           │                   │                   │
           ▼                   ▼                   ▼
┌──────────────────────────────────────────────────────────────┐
│  Data Layer — 8 Schema-Governed Artifacts                     │
│                                                              │
│  campaign.yaml    "What system?"         Configuration       │
│  bundle.yaml      "What are we testing?" Hypothesis bundle   │
│  findings.json    "What happened?"       Prediction vs outcome│
│  principles.json  "What did we learn?"   Living knowledge base│
│  state.json       "Where are we?"        Orchestrator checkpoint│
│  ledger.json      "What's the history?"  Append-only log     │
│  trace.jsonl      "What happened inside?"Observability       │
│  summary.json     "How did it go?"       Campaign report card│
└──────────────────────────────────────────────────────────────┘

Why the layers are separate

AI reasons, Python enforces.

UX layer — Three commands. The engineer never thinks about state machines.
Agent layer — AI does the creative work. Swappable — better models drop in.
Orchestrator — Deterministic code. Gates cannot be bypassed. Fast-fail rules always fire. State survives crashes.
Data layer — Every artifact has a schema. Agents can only produce well-formed output. Everything is auditable.

Key Mechanisms

Human gates — Two hard stops per iteration. You approve the experiment plan before any code runs, and you approve the results before principles are extracted.

Fast-fail rules — Main hypothesis refuted? Skip remaining arms, go straight to learning. No wasted compute.

Worktree isolation — Every experiment runs in an isolated git worktree. Your branch is never touched.

Prediction error taxonomy — When a prediction is wrong:

Direction — the whole idea is wrong (most valuable to learn from)
Magnitude — right idea, wrong amount
Regime — right idea, wrong conditions

Compounding knowledge — Principles from iteration N constrain iteration N+1. The system gets smarter. Prediction accuracy trends upward.

How It Generalizes

Nous works on any system where you can measure something, change something, run it again, and reason about its parts. The generalization happens in three places:

Campaign config (campaign.yaml) — describes the target system. Generated by /nous:init for any repo.
Two-layer prompts — generic methodology (shared) + domain adapter (generated per system).
Meta-principles — lessons about the investigation process itself that transfer across all campaigns.

Why a Top Engineer Would Use This

A top engineer already thinks scientifically about their system. They have intuitions, mental models, informal hypotheses. What they lack:

Discipline — They know they should run controls and ablations. It's just tedious. Nous automates it.
Memory — Insights live in their head, Slack threads, half-finished docs. Nous keeps a living knowledge base with evidence chains.
Rigor — One seed, one workload, one run proves nothing. Nous enforces multi-seed, multi-condition experiments.
Error taxonomy — When something fails, knowing what kind of wrong (direction vs magnitude vs regime) determines what to do next.

Nous doesn't replace the engineer's thinking. It structures it — and makes Claude operate at the same level of rigor the engineer aspires to but rarely achieves under time pressure.

Phase Plan

Phase 1: Schemas + Orchestrator Skeleton — DONE

Issue: #11 | PR: #14

Built the foundation:

8 JSON schemas defining every artifact
7 templates for starting new campaigns
Deterministic orchestrator: 11-state machine, human gates, fast-fail rules, atomic checkpointing
Stub agent dispatch (produces valid artifacts without calling LLMs)
141 tests
Protocol, architecture, and data model documentation

What you can do after this phase: Run the full orchestrator loop with stub agents. Validate that the state machine, gates, and fast-fail rules work correctly.

Phase 2: Agent Prompts + LLM Dispatch

Issue: #8 | Status: Not started | Depends on: Phase 1

Replace stubs with real LLM agents.

What you can do after this phase: Run a real single-iteration experiment on BLIS via the Python API.

Phase 3: Plugin UX — Init, Investigate, Status

Issue: #19 | Status: Not started | Depends on: Phase 2

Make it easy to use. Three Claude Code plugin skills.

What you can do after this phase: Install Nous as a plugin, run /nous:init on any repo, and start investigating.

Phase 4: Multi-Iteration Campaigns + Observability

Issue: #9 | Status: Not started | Depends on: Phases 1–3

Scale from single iterations to sustained campaigns.

What you can do after this phase: Run 3–5 iteration campaigns on BLIS. Compare discovered principles to the known BLIS principle catalog.

Phase 5: Outer Loop + Co-Evolution

Issue: #10 | Status: Not started | Depends on: Phases 1–4

Real-world validation, VoI-governed experiment selection, and self-improvement.

Summary: Path to "Easy to Use"

Phase 1 (done)     Skeleton works with stubs
        ↓
Phase 2            Skeleton works with real LLMs
        ↓          Milestone: run one iteration on BLIS (Python API)
Phase 3            Easy to use
        ↓          Milestone: /nous:init + /nous:investigate on BLIS
Phase 4            Sustained campaigns
        ↓          Milestone: 5-iteration campaign, principle discovery
Phase 5            Self-improving
                   Milestone: real-world validation, co-evolution

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nous — Usage Story, Architecture, and Phase Plan #20

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Nous — Usage Story, Architecture, and Phase Plan #20

Uh oh!

mtoslalibu Apr 17, 2026 Maintainer

What Is Nous?

The Gap We're Closing

The Problem

How You Use It

Install

Three Commands

Walkthrough: Investigating a Scheduler

1. Set up — /nous:init

2. Investigate — /nous:investigate

3. Next iteration

4. Check progress — /nous:status

Walkthrough: A Different System

How It Compares

Nous vs Plain Claude

Nous vs Evolutionary Search (OpenEvolve/ADRS)

What's Underneath: The Architecture

Why the layers are separate

Key Mechanisms

How It Generalizes

Why a Top Engineer Would Use This

Phase Plan

Phase 1: Schemas + Orchestrator Skeleton — DONE

Phase 2: Agent Prompts + LLM Dispatch

Phase 3: Plugin UX — Init, Investigate, Status

Phase 4: Multi-Iteration Campaigns + Observability

Phase 5: Outer Loop + Co-Evolution

Summary: Path to "Easy to Use"

Replies: 0 comments

mtoslalibu
Apr 17, 2026
Maintainer

1. Set up — `/nous:init`

2. Investigate — `/nous:investigate`

4. Check progress — `/nous:status`