You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Engineers constantly investigate software systems. Why does latency spike under this workload? What routing algorithm works best here? Where does the system break under load? What configuration maximizes throughput?
Current tools only cover parts of this. Conversational AI is broad but unstructured — insights don't compound across sessions. Automated search frameworks are structured but narrow — they find what scores highest, not why. Neither enables systematic, compounding, full-spectrum investigation.
What is ASED?
Agentic Systems Experimentation and Discovery (ASED) is the class of activities where an AI agent systematically investigates a software system — experimenting with it, understanding its behavior, and discovering things about it.
Use Case
What you're trying to do
Algorithm discovery
Find a better algorithm or heuristic
Performance optimization
Find configurations that optimize a target metric
Fault discovery
Uncover failure modes, bugs, and dangerous conditions
Regime identification
Find the boundaries where strategies break or shift
Mechanism understanding
Learn why things work, not just what works
Strategy evolution
Evolve system strategies through structured iteration
ASED means structured, scientific, compounding investigation — not ad-hoc chat.
Current Approaches and Their Limits
Manual investigation — broad, produces real understanding, but slow. Insights live in someone's head.
Conversational AI (Claude, ChatGPT, Copilot) — broad in scope, but unstructured. Sessions are ephemeral — no compounding, no hypothesis testing, no audit trail. The human is the process.
Evolutionary AI search (FunSearch, AlphaEvolve, SkyDiscover) — structured and reproducible, but narrow. Enables algorithm discovery only. The LLM is a code mutator, not a scientist. You learn what scores highest — not why.
Grid search, BO, fuzzing — structured but each covers only config search, fault finding, or parameter tuning.
Narrow scope
Broad scope
Unstructured
Fuzzing, random testing
Manual investigation, Conversational AI
Structured & scientific
Grid search, BO, SkyDiscover
Nous
Nous
Nous is a framework that runs the scientific method on software systems using AI agents.
Two key properties make it work. First, hypothesis-driven experimentation: the agent forms a falsifiable claim, designs a controlled experiment to test it, and learns from the outcome either way. Refuted hypotheses are as valuable as confirmed ones — they reveal where the mental model was wrong. Second, compounding knowledge: principles extracted from iteration N constrain the design space of iteration N+1. The system gets smarter over time — not just better-configured, but better at predicting how the system will behave.
Nous is domain-agnostic. It applies to any system with a simulator or testbed: LLM inference serving, database transaction scheduling, distributed load balancing, cloud resource allocation, and more.
Scope and Applicability
When Nous applies
Nous requires four preconditions. A system must have all of them.
Precondition
What it means
Why Nous needs it
Observable metrics
The system produces measurable outputs (latency, throughput, error rate, resource utilization, etc.)
Hypotheses require quantitative predictions — without metrics, claims cannot be falsified
Controllable policy space
There are knobs to turn — algorithms, configurations, scheduling policies, routing rules, resource limits
If nothing varies, there is nothing to investigate
Reproducible execution
A simulator, testbed, or staging environment where experiments can be repeated under controlled conditions with multiple seeds
Controlled repetition is required to separate signal from noise and isolate mechanisms
Decomposable mechanisms
System behavior arises from interacting components that can be reasoned about individually (e.g., routing + scheduling + caching) rather than a single opaque monolith
Hypothesis bundles ablate components — this requires that components can be isolated
One-line summary: Nous applies to systems where you can vary policies, measure outcomes, repeat experiments, and reason about mechanisms.
In-scope problem classes
The engineer is asking one of these questions:
Problem class
Question form
Example
Algorithm discovery
"What mechanism should the system use?"
Best routing signal combination for LLM inference
Performance optimization
"What settings maximize metric X under constraint Y?"
Config that minimizes P99 latency while maintaining throughput
Fault discovery
"Where does the system break or misbehave?"
KV-utilization scorer hurts under memory pressure
Regime identification
"When does strategy A stop working?"
Admission control has no effect below 50% utilization
Mechanism understanding
"Why does X happen?"
Why scheduling priority is zero-sum at saturation
Strategy evolution
"How should the system's strategy change over time?"
Evolving from single-signal to compound routing as load increases
These are examples, not an exhaustive list. Any systematic investigation of runtime behavior, performance, correctness, or policy effectiveness is in scope.
Out of scope
Out of scope
Why
Code synthesis — generating programs from scratch
Nous investigates existing systems. For automated code generation and search, use SkyDiscover or AlphaEvolve.
Nous is empirical (experiment-based), not formal (proof-based). Empirical methods can find bugs but cannot guarantee absence of bugs.
One-shot optimization — "just give me the best config"
Nous builds understanding over iterations. For pure parameter search without understanding, Bayesian optimization is faster and sufficient.
Systems without a policy space
If there are no knobs to vary, there is nothing to investigate.
Non-reproducible environments
Without controlled repetition, hypotheses cannot be tested. (Partial exception: production A/B testing with sufficient traffic can substitute, at higher cost and lower control.)
Opaque monolithic systems
If mechanisms cannot be isolated, ablation is impossible and Nous degrades to black-box optimization — providing no understanding advantage over BO.
One-line framing: Nous is for investigating how a system behaves and why — not for building the system in the first place.
Is my system amenable to ASED?
Does your system have measurable metrics?
└─ No → Out of scope
└─ Yes → Can you vary policies/algorithms/configs?
└─ No → Out of scope
└─ Yes → Can you run controlled experiments (simulator, testbed, or A/B)?
└─ No → Out of scope
└─ Yes → Can you reason about individual mechanisms?
└─ No → Nous degrades to black-box optimization
└─ Yes → ✓ Nous applies
What Nous Discovers
A single Nous campaign produces discoveries across multiple ASED use cases simultaneously. From the BLIS case study (LLM inference serving):
Discovery type
Example
Algorithm discovery
Optimal routing signal pair (prefix-affinity + queue-depth); admission control as a "third lever"
Fault discovery
KV-utilization scorer is counterproductive under memory pressure
Regime identification
Admission control effect vanishes at low load; PA:QD ratio safety boundary
Mechanism understanding
Why scheduling alone is zero-sum at saturation
Strategy optimization
Compound strategy that generalizes across 4 workload shapes
The Nous Process
6.1 The 5-Phase Loop
┌─────────────────────────────────────────────────────┐
│ │
│ 1. Frame What are we investigating? │
│ the What's the baseline? │
│ Problem What does success look like? │
│ │ │
│ ▼ │
│ 2. Design Form a falsifiable hypothesis. │
│ Hypothesis Decompose it into testable arms. │
│ Bundle Get it reviewed. Get approved. │
│ │ │
│ ▼ │
│ 3. Run Implement. Run experiments. │
│ Experiments Analyze results. Document │
│ findings. Get findings reviewed. │
│ │ │
│ ▼ │
│ 4. Tune If the mechanism works, │
│ Parameters optimize its parameters. │
│ │ │
│ ▼ │
│ 5. Extract Insert new principles. │
│ Principles Update or prune existing ones. │
│ & Iterate Feed the store into next iter. │
│ │ │
│ └──────────────────────────────► repeat │
│ │
└─────────────────────────────────────────────────────┘
Phase
Input
Output
Gate
Frame
Research question
problem.md, baseline
—
Design
Principles, investigation summary
Hypothesis bundle
AI Design Review + Human Approval
Run
Approved bundle
Experiment results, findings.md
AI Findings Review + Human Approval
Tune
Confirmed mechanism
Optimized parameters
— (skip if H-main refuted)
Extract
Findings + principle store
Updated principles, investigation summary
—
Hypothesis bundles decompose each candidate strategy into testable arms before any implementation:
Arm
Tests
Example
H-main
Does the mechanism work?
"Compound routing reduces critical TTFT P99 by >40%"
H-ablation
Which components matter?
"Removing admission control degrades by >15%"
H-super-additivity
Non-linear interaction?
"Compound effect > sum of individual effects"
H-control-negative
Where should effect vanish?
"At low load (<50% util), compound ≈ round-robin"
H-robustness
Generalization?
"Effect holds under bursty (Gamma) arrivals"
Each arm is a triple: (prediction, mechanism, diagnostic) — a quantitative claim, a causal explanation, and what to investigate if the prediction is wrong.
Bundle sizing — not every iteration needs all five arm types:
Iteration type
Required arms
Optional
New compound mechanism (≥2 components)
H-main, all H-ablation, H-super-additivity, H-control-negative
H-robustness
Component removal/simplification
H-main, H-control-negative, removal ablation
H-robustness
Single-component mechanism
H-main, H-control-negative
H-robustness
Parameter-only change
H-main only
—
Robustness sweep (post-confirmation)
H-robustness arms only
—
Fast-fail rules — the orchestrator enforces these to avoid wasted work:
H-main refuted → skip remaining ablation/robustness arms, go directly to Extraction
Single dominant component (>80% of effect) → simplify strategy, drop minor components
H-control-negative fails → mechanism is confounded, redesign before continuing
Prediction error classification — when a prediction is wrong, the type of error guides what the Extractor does:
Error type
Meaning
Extractor action
Direction wrong
Fundamental misunderstanding of mechanism
Prune or heavily revise the principle
Magnitude wrong
Correct mechanism, inaccurate model of strength
Update principle with calibrated bounds
Regime wrong
Mechanism works under different conditions than predicted
Update principle with correct regime boundaries
Two quality gates sit inside the loop. At each gate: AI reviewers run independently from multiple perspectives, must converge (no CRITICAL findings), then the human approves. The human sees the AI review output before deciding. Design gate catches flaws before implementation; Findings gate catches analytical errors and bad causal claims before principles enter the store.
6.2 Principle Store (Insert / Update / Prune)
The principle store is the campaign's memory — a living knowledge base, not a log. Three operations:
Insert — add a new principle from a confirmed or refuted hypothesis
Update — refine an existing principle's scope, parameters, or confidence
Prune — mark a principle as superseded or refuted by new evidence
A store that drifts from reality constrains future iterations based on wrong beliefs — worse than no store at all. The Extraction phase is responsible for all three operations. The agent is explicitly prompted: "Which existing principles need to be refined or invalidated by these results?"
6.3 Investigation Summary (Bounded Working Memory)
After each Extraction, the agent produces a compact investigation summary — what has been investigated, what principles hold, what families are open vs. exhausted, what the most promising next directions are. This summary replaces the full ledger as the agent's working context in the next iteration.
The next iteration's Design prompt is seeded with: (research question, investigation summary, last experiment outcome) — not the full history. Context size stays O(summary) regardless of campaign depth. The full ledger is always on disk.
6.4 Mechanism Families and Convergence
A campaign investigates one mechanism family at a time — a cluster of related system knobs (e.g., routing signals, scheduling priority, admission control). Families are independent: each runs its own hypothesis loop, and discoveries in one family propagate to others via the principle store.
Stagnation signal: The orchestrator tracks the count of consecutive iterations with no new principles extracted. This count is surfaced to the human as an informational signal — the human decides when a family is exhausted and when to switch or stop. The system informs; the human decides.
6.5 Structured Ledger
One JSON record per completed iteration: hypothesis bundle, experiment outcomes, principles extracted, prediction accuracy. The ledger feeds into the next iteration's Design prompt (via the investigation summary). It is never modified — only appended. Full audit trail.
Agentic Implementation
7.1 Architecture Overview
┌──────────────────────────────────────────────────────────┐
│ ORCHESTRATOR │
│ (shell script or Python — NOT an LLM) │
│ Owns: state machine, file I/O, gate logic │
│ Drives: phase transitions, Claude invocations │
├──────────────────────────────────────────────────────────┤
│ │
│ ┌───────────┐ ┌───────────┐ ┌──────────┐ ┌────────┐ │
│ │ PLANNER │ │ EXECUTOR │ │ REVIEWER │ │EXTRACT.│ │
│ │ (Claude) │ │ (Claude) │ │ (Claude) │ │(Claude)│ │
│ └───────────┘ └───────────┘ └──────────┘ └────────┘ │
│ │
├──────────────────────────────────────────────────────────┤
│ Files on disk: │
│ state.json — current phase, family, iteration │
│ ledger.json — one record per completed iteration │
│ principles.json — the principle store │
│ summary.md — investigation summary (bounded) │
│ runs/<iter>/ — per-iteration artifacts │
└──────────────────────────────────────────────────────────┘
The orchestrator is a script, not an LLM. Claude does not decide when to transition states — the orchestrator does, based on file existence, exit codes, and review convergence checks. This keeps the state machine deterministic and auditable.
7.2 Agent Roles
Role
Phase
Job
Tools
Planner
Frame, Design
Produce hypothesis bundle from principles + summary
Read files, write hypothesis.md
Executor
Run, Tune
Implement experiment, run simulator, analyze results
Read/write files, shell commands
Reviewer
Design Review, Findings Review
Review artifact from assigned perspective
Read files only, write review-<perspective>.md
Extractor
Extract
Update principle store (insert/update/prune), rewrite summary
Read/write principles.json, summary.md
Each role is a separate Claude invocation. Reviewers run in parallel (multiple perspectives simultaneously).
Invocation sketches:
# Planner (Design phase)
claude -p "You are a research planner. Design a hypothesis bundle for the next iteration.Read: principles.json, summary.md. Write: runs/<iter>/hypothesis.md" --allowedTools Read,Write
# Executor (Run phase)
claude -p "You are an experiment executor. Implement and run the approved experiment.Read: runs/<iter>/hypothesis.md. Write: runs/<iter>/findings.md" --allowedTools Read,Write,Bash
# Reviewer — multiple in parallel, one per perspective
claude -p "You are a design reviewer (perspective: statistical rigor). Review the hypothesis bundle.Read: runs/<iter>/hypothesis.md, principles.json. Write: runs/<iter>/reviews/review-stats.md" --allowedTools Read,Write
# Extractor (Extract phase)
claude -p "You are a principle extractor. Update the principle store from these findings.Read: runs/<iter>/findings.md, principles.json, summary.md.Write: principles.json (insert/update/prune), summary.md (rewrite)" --allowedTools Read,Write
Start: Human writes problem.md (research question, baseline, success criteria) and runs the orchestrator.
Autonomous run: The orchestrator runs Planner → Reviewer → Executor → Extractor automatically, advancing through phases.
Human gates: At each gate, the orchestrator pauses and surfaces the artifact + AI review summaries. The human approves, rejects with feedback, or aborts.
Inspection: Human can inspect state.json, ledger.json, principles.json, and summary.md at any time between gates.
Control: Human can switch mechanism families, add manual overrides to principles.json, or adjust stagnation thresholds.
The human is not in the loop between gates. The system handles all intermediate steps.
7.5 Tool Access by Role
Role
Read files
Write files
Shell commands
Human interaction
Planner
✓
hypothesis.md
—
—
Executor
✓
findings.md, results/
✓ (simulator, scripts)
—
Reviewer
✓
review-*.md
—
—
Extractor
✓
principles.json, summary.md
—
—
Human
all
problem.md, overrides
—
gates
Observability
Every campaign has a unique run ID. All events are logged to runs/<run-id>/trace.jsonl — one line per LLM call, tool call, or state transition. A summary.json is auto-generated when the campaign reaches DONE: total cost, token counts, cost by state, and per-iteration stats.
Key separation of concerns:
Ledger — what was discovered (scientific content)
Trace — how it happened and what it cost
Summary — rolled-up stats for the run
These are separate files. The ledger does not carry cost data.
Reproducibility
Nous cannot guarantee bit-for-bit reproducibility — LLM outputs are non-deterministic even with identical prompts. But it targets scientific reproducibility: two independent runs on the same system should reach the same conclusions.
This works because the LLM is not the source of truth — the experiments are. The LLM generates hypotheses and extracts principles, but a principle only enters the store if a deterministic experiment backs it. Different runs may produce different hypothesis wording, but if the system behavior is the same, the verified principles converge.
What is fully reproducible:
Experiments — same seed → same simulator output. Any experiment in the ledger can be re-run by anyone and produce identical numbers.
Process — the state machine, bundle structure, review protocol, and ledger schema are fixed. Anyone following the same process runs the same kind of investigation.
Prompts — every prompt sent to Claude is stored in the trace and can be re-run.
What is statistically reproducible:
Principles — run the campaign K times independently. Principles appearing in all K runs are robust; those appearing in only 1 of K are fragile and should be flagged.
Remaining risk: the LLM may consistently miss a region of the mechanism space across all runs. Multiple reviewer perspectives and human gates mitigate this but do not eliminate it.
Practical recommendation: for any published result, run the campaign at least 3 times and report which principles appeared in all runs.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
The Problem
Engineers constantly investigate software systems. Why does latency spike under this workload? What routing algorithm works best here? Where does the system break under load? What configuration maximizes throughput?
Current tools only cover parts of this. Conversational AI is broad but unstructured — insights don't compound across sessions. Automated search frameworks are structured but narrow — they find what scores highest, not why. Neither enables systematic, compounding, full-spectrum investigation.
What is ASED?
Agentic Systems Experimentation and Discovery (ASED) is the class of activities where an AI agent systematically investigates a software system — experimenting with it, understanding its behavior, and discovering things about it.
ASED means structured, scientific, compounding investigation — not ad-hoc chat.
Current Approaches and Their Limits
Manual investigation — broad, produces real understanding, but slow. Insights live in someone's head.
Conversational AI (Claude, ChatGPT, Copilot) — broad in scope, but unstructured. Sessions are ephemeral — no compounding, no hypothesis testing, no audit trail. The human is the process.
Evolutionary AI search (FunSearch, AlphaEvolve, SkyDiscover) — structured and reproducible, but narrow. Enables algorithm discovery only. The LLM is a code mutator, not a scientist. You learn what scores highest — not why.
Grid search, BO, fuzzing — structured but each covers only config search, fault finding, or parameter tuning.
Nous
Nous is a framework that runs the scientific method on software systems using AI agents.
Two key properties make it work. First, hypothesis-driven experimentation: the agent forms a falsifiable claim, designs a controlled experiment to test it, and learns from the outcome either way. Refuted hypotheses are as valuable as confirmed ones — they reveal where the mental model was wrong. Second, compounding knowledge: principles extracted from iteration N constrain the design space of iteration N+1. The system gets smarter over time — not just better-configured, but better at predicting how the system will behave.
Nous is domain-agnostic. It applies to any system with a simulator or testbed: LLM inference serving, database transaction scheduling, distributed load balancing, cloud resource allocation, and more.
Scope and Applicability
When Nous applies
Nous requires four preconditions. A system must have all of them.
One-line summary: Nous applies to systems where you can vary policies, measure outcomes, repeat experiments, and reason about mechanisms.
In-scope problem classes
The engineer is asking one of these questions:
These are examples, not an exhaustive list. Any systematic investigation of runtime behavior, performance, correctness, or policy effectiveness is in scope.
Out of scope
One-line framing: Nous is for investigating how a system behaves and why — not for building the system in the first place.
Is my system amenable to ASED?
What Nous Discovers
A single Nous campaign produces discoveries across multiple ASED use cases simultaneously. From the BLIS case study (LLM inference serving):
The Nous Process
6.1 The 5-Phase Loop
problem.md, baselinefindings.mdHypothesis bundles decompose each candidate strategy into testable arms before any implementation:
Each arm is a triple: (prediction, mechanism, diagnostic) — a quantitative claim, a causal explanation, and what to investigate if the prediction is wrong.
Bundle sizing — not every iteration needs all five arm types:
Fast-fail rules — the orchestrator enforces these to avoid wasted work:
Prediction error classification — when a prediction is wrong, the type of error guides what the Extractor does:
Two quality gates sit inside the loop. At each gate: AI reviewers run independently from multiple perspectives, must converge (no CRITICAL findings), then the human approves. The human sees the AI review output before deciding. Design gate catches flaws before implementation; Findings gate catches analytical errors and bad causal claims before principles enter the store.
6.2 Principle Store (Insert / Update / Prune)
The principle store is the campaign's memory — a living knowledge base, not a log. Three operations:
A store that drifts from reality constrains future iterations based on wrong beliefs — worse than no store at all. The Extraction phase is responsible for all three operations. The agent is explicitly prompted: "Which existing principles need to be refined or invalidated by these results?"
6.3 Investigation Summary (Bounded Working Memory)
After each Extraction, the agent produces a compact investigation summary — what has been investigated, what principles hold, what families are open vs. exhausted, what the most promising next directions are. This summary replaces the full ledger as the agent's working context in the next iteration.
The next iteration's Design prompt is seeded with: (research question, investigation summary, last experiment outcome) — not the full history. Context size stays O(summary) regardless of campaign depth. The full ledger is always on disk.
6.4 Mechanism Families and Convergence
A campaign investigates one mechanism family at a time — a cluster of related system knobs (e.g., routing signals, scheduling priority, admission control). Families are independent: each runs its own hypothesis loop, and discoveries in one family propagate to others via the principle store.
Stagnation signal: The orchestrator tracks the count of consecutive iterations with no new principles extracted. This count is surfaced to the human as an informational signal — the human decides when a family is exhausted and when to switch or stop. The system informs; the human decides.
6.5 Structured Ledger
One JSON record per completed iteration: hypothesis bundle, experiment outcomes, principles extracted, prediction accuracy. The ledger feeds into the next iteration's Design prompt (via the investigation summary). It is never modified — only appended. Full audit trail.
Agentic Implementation
7.1 Architecture Overview
The orchestrator is a script, not an LLM. Claude does not decide when to transition states — the orchestrator does, based on file existence, exit codes, and review convergence checks. This keeps the state machine deterministic and auditable.
7.2 Agent Roles
hypothesis.mdreview-<perspective>.mdprinciples.json,summary.mdEach role is a separate Claude invocation. Reviewers run in parallel (multiple perspectives simultaneously).
Invocation sketches:
7.3 State Machine and Transitions
problem.mdwrittenhypothesis.mdwrittenfindings.mdwrittenOrchestrator loop (pseudocode):
7.4 User Experience
problem.md(research question, baseline, success criteria) and runs the orchestrator.state.json,ledger.json,principles.json, andsummary.mdat any time between gates.principles.json, or adjust stagnation thresholds.The human is not in the loop between gates. The system handles all intermediate steps.
7.5 Tool Access by Role
hypothesis.mdfindings.md,results/review-*.mdprinciples.json,summary.mdproblem.md, overridesObservability
Every campaign has a unique run ID. All events are logged to
runs/<run-id>/trace.jsonl— one line per LLM call, tool call, or state transition. Asummary.jsonis auto-generated when the campaign reaches DONE: total cost, token counts, cost by state, and per-iteration stats.Key separation of concerns:
These are separate files. The ledger does not carry cost data.
Reproducibility
Nous cannot guarantee bit-for-bit reproducibility — LLM outputs are non-deterministic even with identical prompts. But it targets scientific reproducibility: two independent runs on the same system should reach the same conclusions.
This works because the LLM is not the source of truth — the experiments are. The LLM generates hypotheses and extracts principles, but a principle only enters the store if a deterministic experiment backs it. Different runs may produce different hypothesis wording, but if the system behavior is the same, the verified principles converge.
What is fully reproducible:
What is statistically reproducible:
Remaining risk: the LLM may consistently miss a region of the mechanism space across all runs. Multiple reviewer perspectives and human gates mitigate this but do not eliminate it.
Practical recommendation: for any published result, run the campaign at least 3 times and report which principles appeared in all runs.
Beta Was this translation helpful? Give feedback.
All reactions