Agentic Systems Experimentation and Discovery (ASED) #2

sriumcp · 2026-03-27T18:11:22Z

sriumcp
Mar 27, 2026
Maintainer

The Problem

Engineers constantly investigate software systems. Why does latency spike under this workload? What routing algorithm works best here? Where does the system break under load? What configuration maximizes throughput?

Current tools only cover parts of this. Conversational AI is broad but unstructured — insights don't compound across sessions. Automated search frameworks are structured but narrow — they find what scores highest, not why. Neither enables systematic, compounding, full-spectrum investigation.

What is ASED?

Agentic Systems Experimentation and Discovery (ASED) is the class of activities where an AI agent systematically investigates a software system — experimenting with it, understanding its behavior, and discovering things about it.

Use Case	What you're trying to do
Algorithm discovery	Find a better algorithm or heuristic
Performance optimization	Find configurations that optimize a target metric
Fault discovery	Uncover failure modes, bugs, and dangerous conditions
Regime identification	Find the boundaries where strategies break or shift
Mechanism understanding	Learn why things work, not just what works
Strategy evolution	Evolve system strategies through structured iteration

ASED means structured, scientific, compounding investigation — not ad-hoc chat.

Current Approaches and Their Limits

Manual investigation — broad, produces real understanding, but slow. Insights live in someone's head.

Conversational AI (Claude, ChatGPT, Copilot) — broad in scope, but unstructured. Sessions are ephemeral — no compounding, no hypothesis testing, no audit trail. The human is the process.

Evolutionary AI search (FunSearch, AlphaEvolve, SkyDiscover) — structured and reproducible, but narrow. Enables algorithm discovery only. The LLM is a code mutator, not a scientist. You learn what scores highest — not why.

Grid search, BO, fuzzing — structured but each covers only config search, fault finding, or parameter tuning.

	Narrow scope	Broad scope
Unstructured	Fuzzing, random testing	Manual investigation, Conversational AI
Structured & scientific	Grid search, BO, SkyDiscover	Nous

Nous

Nous is a framework that runs the scientific method on software systems using AI agents.

Two key properties make it work. First, hypothesis-driven experimentation: the agent forms a falsifiable claim, designs a controlled experiment to test it, and learns from the outcome either way. Refuted hypotheses are as valuable as confirmed ones — they reveal where the mental model was wrong. Second, compounding knowledge: principles extracted from iteration N constrain the design space of iteration N+1. The system gets smarter over time — not just better-configured, but better at predicting how the system will behave.

Nous is domain-agnostic. It applies to any system with a simulator or testbed: LLM inference serving, database transaction scheduling, distributed load balancing, cloud resource allocation, and more.

Scope and Applicability

When Nous applies

Nous requires four preconditions. A system must have all of them.

Precondition	What it means	Why Nous needs it
Observable metrics	The system produces measurable outputs (latency, throughput, error rate, resource utilization, etc.)	Hypotheses require quantitative predictions — without metrics, claims cannot be falsified
Controllable policy space	There are knobs to turn — algorithms, configurations, scheduling policies, routing rules, resource limits	If nothing varies, there is nothing to investigate
Reproducible execution	A simulator, testbed, or staging environment where experiments can be repeated under controlled conditions with multiple seeds	Controlled repetition is required to separate signal from noise and isolate mechanisms
Decomposable mechanisms	System behavior arises from interacting components that can be reasoned about individually (e.g., routing + scheduling + caching) rather than a single opaque monolith	Hypothesis bundles ablate components — this requires that components can be isolated

One-line summary: Nous applies to systems where you can vary policies, measure outcomes, repeat experiments, and reason about mechanisms.

In-scope problem classes

The engineer is asking one of these questions:

Problem class	Question form	Example
Algorithm discovery	"What mechanism should the system use?"	Best routing signal combination for LLM inference
Performance optimization	"What settings maximize metric X under constraint Y?"	Config that minimizes P99 latency while maintaining throughput
Fault discovery	"Where does the system break or misbehave?"	KV-utilization scorer hurts under memory pressure
Regime identification	"When does strategy A stop working?"	Admission control has no effect below 50% utilization
Mechanism understanding	"Why does X happen?"	Why scheduling priority is zero-sum at saturation
Strategy evolution	"How should the system's strategy change over time?"	Evolving from single-signal to compound routing as load increases

These are examples, not an exhaustive list. Any systematic investigation of runtime behavior, performance, correctness, or policy effectiveness is in scope.

Out of scope

Out of scope	Why
Code synthesis — generating programs from scratch	Nous investigates existing systems. For automated code generation and search, use SkyDiscover or AlphaEvolve.
Formal verification — proving mathematical properties	Nous is empirical (experiment-based), not formal (proof-based). Empirical methods can find bugs but cannot guarantee absence of bugs.
One-shot optimization — "just give me the best config"	Nous builds understanding over iterations. For pure parameter search without understanding, Bayesian optimization is faster and sufficient.
Systems without a policy space	If there are no knobs to vary, there is nothing to investigate.
Non-reproducible environments	Without controlled repetition, hypotheses cannot be tested. (Partial exception: production A/B testing with sufficient traffic can substitute, at higher cost and lower control.)
Opaque monolithic systems	If mechanisms cannot be isolated, ablation is impossible and Nous degrades to black-box optimization — providing no understanding advantage over BO.

One-line framing: Nous is for investigating how a system behaves and why — not for building the system in the first place.

Is my system amenable to ASED?

Does your system have measurable metrics?
  └─ No  → Out of scope
  └─ Yes → Can you vary policies/algorithms/configs?
              └─ No  → Out of scope
              └─ Yes → Can you run controlled experiments (simulator, testbed, or A/B)?
                          └─ No  → Out of scope
                          └─ Yes → Can you reason about individual mechanisms?
                                      └─ No  → Nous degrades to black-box optimization
                                      └─ Yes → ✓ Nous applies

What Nous Discovers

A single Nous campaign produces discoveries across multiple ASED use cases simultaneously. From the BLIS case study (LLM inference serving):

Discovery type	Example
Algorithm discovery	Optimal routing signal pair (prefix-affinity + queue-depth); admission control as a "third lever"
Fault discovery	KV-utilization scorer is counterproductive under memory pressure
Regime identification	Admission control effect vanishes at low load; PA:QD ratio safety boundary
Mechanism understanding	Why scheduling alone is zero-sum at saturation
Strategy optimization	Compound strategy that generalizes across 4 workload shapes

The Nous Process

6.1 The 5-Phase Loop

┌─────────────────────────────────────────────────────┐
│                                                     │
│   1. Frame        What are we investigating?        │
│      the          What's the baseline?              │
│      Problem      What does success look like?      │
│         │                                           │
│         ▼                                           │
│   2. Design       Form a falsifiable hypothesis.    │
│      Hypothesis   Decompose it into testable arms.  │
│      Bundle       Get it reviewed. Get approved.    │
│         │                                           │
│         ▼                                           │
│   3. Run          Implement. Run experiments.       │
│      Experiments  Analyze results. Document         │
│                   findings. Get findings reviewed.  │
│         │                                           │
│         ▼                                           │
│   4. Tune         If the mechanism works,           │
│      Parameters   optimize its parameters.          │
│         │                                           │
│         ▼                                           │
│   5. Extract      Insert new principles.            │
│      Principles   Update or prune existing ones.    │
│      & Iterate    Feed the store into next iter.    │
│         │                                           │
│         └──────────────────────────────► repeat     │
│                                                     │
└─────────────────────────────────────────────────────┘

Phase	Input	Output	Gate
Frame	Research question	`problem.md`, baseline	—
Design	Principles, investigation summary	Hypothesis bundle	AI Design Review + Human Approval
Run	Approved bundle	Experiment results, `findings.md`	AI Findings Review + Human Approval
Tune	Confirmed mechanism	Optimized parameters	— (skip if H-main refuted)
Extract	Findings + principle store	Updated principles, investigation summary	—

Hypothesis bundles decompose each candidate strategy into testable arms before any implementation:

Arm	Tests	Example
H-main	Does the mechanism work?	"Compound routing reduces critical TTFT P99 by >40%"
H-ablation	Which components matter?	"Removing admission control degrades by >15%"
H-super-additivity	Non-linear interaction?	"Compound effect > sum of individual effects"
H-control-negative	Where should effect vanish?	"At low load (<50% util), compound ≈ round-robin"
H-robustness	Generalization?	"Effect holds under bursty (Gamma) arrivals"

Each arm is a triple: (prediction, mechanism, diagnostic) — a quantitative claim, a causal explanation, and what to investigate if the prediction is wrong.

Bundle sizing — not every iteration needs all five arm types:

Iteration type	Required arms	Optional
New compound mechanism (≥2 components)	H-main, all H-ablation, H-super-additivity, H-control-negative	H-robustness
Component removal/simplification	H-main, H-control-negative, removal ablation	H-robustness
Single-component mechanism	H-main, H-control-negative	H-robustness
Parameter-only change	H-main only	—
Robustness sweep (post-confirmation)	H-robustness arms only	—

Fast-fail rules — the orchestrator enforces these to avoid wasted work:

H-main refuted → skip remaining ablation/robustness arms, go directly to Extraction
Single dominant component (>80% of effect) → simplify strategy, drop minor components
H-control-negative fails → mechanism is confounded, redesign before continuing

Prediction error classification — when a prediction is wrong, the type of error guides what the Extractor does:

Error type	Meaning	Extractor action
Direction wrong	Fundamental misunderstanding of mechanism	Prune or heavily revise the principle
Magnitude wrong	Correct mechanism, inaccurate model of strength	Update principle with calibrated bounds
Regime wrong	Mechanism works under different conditions than predicted	Update principle with correct regime boundaries

Two quality gates sit inside the loop. At each gate: AI reviewers run independently from multiple perspectives, must converge (no CRITICAL findings), then the human approves. The human sees the AI review output before deciding. Design gate catches flaws before implementation; Findings gate catches analytical errors and bad causal claims before principles enter the store.

6.2 Principle Store (Insert / Update / Prune)

The principle store is the campaign's memory — a living knowledge base, not a log. Three operations:

Insert — add a new principle from a confirmed or refuted hypothesis
Update — refine an existing principle's scope, parameters, or confidence
Prune — mark a principle as superseded or refuted by new evidence

A store that drifts from reality constrains future iterations based on wrong beliefs — worse than no store at all. The Extraction phase is responsible for all three operations. The agent is explicitly prompted: "Which existing principles need to be refined or invalidated by these results?"

6.3 Investigation Summary (Bounded Working Memory)

After each Extraction, the agent produces a compact investigation summary — what has been investigated, what principles hold, what families are open vs. exhausted, what the most promising next directions are. This summary replaces the full ledger as the agent's working context in the next iteration.

The next iteration's Design prompt is seeded with: (research question, investigation summary, last experiment outcome) — not the full history. Context size stays O(summary) regardless of campaign depth. The full ledger is always on disk.

6.4 Mechanism Families and Convergence

A campaign investigates one mechanism family at a time — a cluster of related system knobs (e.g., routing signals, scheduling priority, admission control). Families are independent: each runs its own hypothesis loop, and discoveries in one family propagate to others via the principle store.

Stagnation signal: The orchestrator tracks the count of consecutive iterations with no new principles extracted. This count is surfaced to the human as an informational signal — the human decides when a family is exhausted and when to switch or stop. The system informs; the human decides.

6.5 Structured Ledger

One JSON record per completed iteration: hypothesis bundle, experiment outcomes, principles extracted, prediction accuracy. The ledger feeds into the next iteration's Design prompt (via the investigation summary). It is never modified — only appended. Full audit trail.

Agentic Implementation

7.1 Architecture Overview

┌──────────────────────────────────────────────────────────┐
│                   ORCHESTRATOR                            │
│         (shell script or Python — NOT an LLM)            │
│   Owns: state machine, file I/O, gate logic               │
│   Drives: phase transitions, Claude invocations           │
├──────────────────────────────────────────────────────────┤
│                                                           │
│  ┌───────────┐  ┌───────────┐  ┌──────────┐  ┌────────┐ │
│  │  PLANNER  │  │ EXECUTOR  │  │ REVIEWER │  │EXTRACT.│ │
│  │ (Claude)  │  │ (Claude)  │  │ (Claude) │  │(Claude)│ │
│  └───────────┘  └───────────┘  └──────────┘  └────────┘ │
│                                                           │
├──────────────────────────────────────────────────────────┤
│  Files on disk:                                           │
│    state.json        — current phase, family, iteration   │
│    ledger.json       — one record per completed iteration │
│    principles.json   — the principle store                │
│    summary.md        — investigation summary (bounded)    │
│    runs/<iter>/      — per-iteration artifacts            │
└──────────────────────────────────────────────────────────┘

The orchestrator is a script, not an LLM. Claude does not decide when to transition states — the orchestrator does, based on file existence, exit codes, and review convergence checks. This keeps the state machine deterministic and auditable.

7.2 Agent Roles

Role	Phase	Job	Tools
Planner	Frame, Design	Produce hypothesis bundle from principles + summary	Read files, write `hypothesis.md`
Executor	Run, Tune	Implement experiment, run simulator, analyze results	Read/write files, shell commands
Reviewer	Design Review, Findings Review	Review artifact from assigned perspective	Read files only, write `review-<perspective>.md`
Extractor	Extract	Update principle store (insert/update/prune), rewrite summary	Read/write `principles.json`, `summary.md`

Each role is a separate Claude invocation. Reviewers run in parallel (multiple perspectives simultaneously).

Invocation sketches:

# Planner (Design phase)
claude -p "You are a research planner. Design a hypothesis bundle for the next iteration.
Read: principles.json, summary.md. Write: runs/<iter>/hypothesis.md" --allowedTools Read,Write

# Executor (Run phase)
claude -p "You are an experiment executor. Implement and run the approved experiment.
Read: runs/<iter>/hypothesis.md. Write: runs/<iter>/findings.md" --allowedTools Read,Write,Bash

# Reviewer — multiple in parallel, one per perspective
claude -p "You are a design reviewer (perspective: statistical rigor). Review the hypothesis bundle.
Read: runs/<iter>/hypothesis.md, principles.json. Write: runs/<iter>/reviews/review-stats.md" --allowedTools Read,Write

# Extractor (Extract phase)
claude -p "You are a principle extractor. Update the principle store from these findings.
Read: runs/<iter>/findings.md, principles.json, summary.md.
Write: principles.json (insert/update/prune), summary.md (rewrite)" --allowedTools Read,Write

7.3 State Machine and Transitions

From	To	Trigger	Handled by
INIT	FRAMING	Campaign started	Planner
FRAMING	DESIGN	`problem.md` written	Planner
DESIGN	DESIGN_REVIEW	`hypothesis.md` written	Reviewer ×N (parallel)
DESIGN_REVIEW	HUMAN_DESIGN_GATE	All reviews written, no CRITICAL	Human
HUMAN_DESIGN_GATE	DESIGN	Human rejects	Planner (revise)
HUMAN_DESIGN_GATE	RUNNING	Human approves	Executor
RUNNING	FINDINGS_REVIEW	`findings.md` written	Reviewer ×N (parallel)
FINDINGS_REVIEW	HUMAN_FINDINGS_GATE	All reviews written, no CRITICAL	Human
HUMAN_FINDINGS_GATE	RUNNING	Human rejects	Executor (re-run)
HUMAN_FINDINGS_GATE	TUNING	H-main confirmed	Executor
HUMAN_FINDINGS_GATE	EXTRACTION	H-main refuted	Extractor
TUNING	EXTRACTION	Tuning complete	Extractor
EXTRACTION	DESIGN	Principles updated, human continues	Planner (next iteration)
EXTRACTION	DONE	Human decides campaign is complete	—

Orchestrator loop (pseudocode):

while state != "DONE":
    match state:
        case "DESIGN":
            run("claude -p '$PLANNER_PROMPT' --allowedTools Read,Write")
        case "DESIGN_REVIEW":
            run_parallel([f"claude -p '$REVIEWER_PROMPT_{p}'" for p in PERSPECTIVES])
            state = "HUMAN_DESIGN_GATE" if no_criticals() else "DESIGN"
        case "HUMAN_DESIGN_GATE":
            decision = prompt_human("Approve hypothesis bundle? (approve/reject)")
            state = "RUNNING" if decision == "approve" else "DESIGN"
        case "RUNNING":
            run("claude -p '$EXECUTOR_PROMPT' --allowedTools Read,Write,Bash")
        case "FINDINGS_REVIEW":
            run_parallel([f"claude -p '$REVIEWER_PROMPT_{p}'" for p in PERSPECTIVES])
            state = "HUMAN_FINDINGS_GATE" if no_criticals() else "RUNNING"
        case "HUMAN_FINDINGS_GATE":
            decision = prompt_human("Approve findings? (approve/reject)")
            state = next_state_from_findings(decision)
        case "TUNING":
            run("claude -p '$TUNER_PROMPT' --allowedTools Read,Write,Bash")
        case "EXTRACTION":
            run("claude -p '$EXTRACTOR_PROMPT' --allowedTools Read,Write")
            decision = prompt_human("Continue campaign? (continue/done)")
            state = "DESIGN" if decision == "continue" else "DONE"
    update_state_json(state)

7.4 User Experience

Start: Human writes problem.md (research question, baseline, success criteria) and runs the orchestrator.
Autonomous run: The orchestrator runs Planner → Reviewer → Executor → Extractor automatically, advancing through phases.
Human gates: At each gate, the orchestrator pauses and surfaces the artifact + AI review summaries. The human approves, rejects with feedback, or aborts.
Inspection: Human can inspect state.json, ledger.json, principles.json, and summary.md at any time between gates.
Control: Human can switch mechanism families, add manual overrides to principles.json, or adjust stagnation thresholds.

The human is not in the loop between gates. The system handles all intermediate steps.

7.5 Tool Access by Role

Role	Read files	Write files	Shell commands	Human interaction
Planner	✓	`hypothesis.md`	—	—
Executor	✓	`findings.md`, `results/`	✓ (simulator, scripts)	—
Reviewer	✓	`review-*.md`	—	—
Extractor	✓	`principles.json`, `summary.md`	—	—
Human	all	`problem.md`, overrides	—	gates

Observability

Every campaign has a unique run ID. All events are logged to runs/<run-id>/trace.jsonl — one line per LLM call, tool call, or state transition. A summary.json is auto-generated when the campaign reaches DONE: total cost, token counts, cost by state, and per-iteration stats.

Key separation of concerns:

Ledger — what was discovered (scientific content)
Trace — how it happened and what it cost
Summary — rolled-up stats for the run

These are separate files. The ledger does not carry cost data.

Reproducibility

Nous cannot guarantee bit-for-bit reproducibility — LLM outputs are non-deterministic even with identical prompts. But it targets scientific reproducibility: two independent runs on the same system should reach the same conclusions.

This works because the LLM is not the source of truth — the experiments are. The LLM generates hypotheses and extracts principles, but a principle only enters the store if a deterministic experiment backs it. Different runs may produce different hypothesis wording, but if the system behavior is the same, the verified principles converge.

What is fully reproducible:

Experiments — same seed → same simulator output. Any experiment in the ledger can be re-run by anyone and produce identical numbers.
Process — the state machine, bundle structure, review protocol, and ledger schema are fixed. Anyone following the same process runs the same kind of investigation.
Prompts — every prompt sent to Claude is stored in the trace and can be re-run.

What is statistically reproducible:

Principles — run the campaign K times independently. Principles appearing in all K runs are robust; those appearing in only 1 of K are fragile and should be flagged.

Remaining risk: the LLM may consistently miss a region of the mechanism space across all runs. Multiple reviewer perspectives and human gates mitigate this but do not eliminate it.

Practical recommendation: for any published result, run the campaign at least 3 times and report which principles appeared in all runs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agentic Systems Experimentation and Discovery (ASED) #2

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Agentic Systems Experimentation and Discovery (ASED) #2

Uh oh!

sriumcp Mar 27, 2026 Maintainer

The Problem

What is ASED?

Current Approaches and Their Limits

Nous

Scope and Applicability

When Nous applies

In-scope problem classes

Out of scope

Is my system amenable to ASED?

What Nous Discovers

The Nous Process

6.1 The 5-Phase Loop

6.2 Principle Store (Insert / Update / Prune)

6.3 Investigation Summary (Bounded Working Memory)

6.4 Mechanism Families and Convergence

6.5 Structured Ledger

Agentic Implementation

7.1 Architecture Overview

7.2 Agent Roles

7.3 State Machine and Transitions

7.4 User Experience

7.5 Tool Access by Role

Observability

Reproducibility

Replies: 0 comments

sriumcp
Mar 27, 2026
Maintainer