Agentic Strategy Evolution: a three-loop methodology for optimizing multi-layer policy spaces #1

sriumcp · 2026-03-27T15:54:13Z

sriumcp
Mar 27, 2026
Maintainer

Status: Draft for review — feedback welcome on structure, framing, and open questions.
This methodology was developed for LLM inference serving but is domain-agnostic — the three-loop structure applies to any complex system with multiple interacting policy layers and a simulation-based evaluation harness.

The Problem This Solves

In systems with multiple interacting policy layers — routing, scheduling, memory management, admission control — the optimal configuration cannot be derived analytically. Interactions produce non-obvious emergent behaviors: super-additive effects (two mechanisms together outperform their sum), signal cancellation (adding a seemingly relevant signal makes things worse), and regime-dependent dominance (the right answer changes with load).

Classical approaches have blind spots:

Approach	What it misses
Grid/random search	Finds what works, not why it works or where it breaks
Bayesian optimization	Tunes parameters of a given mechanism; doesn't discover which mechanisms are worth tuning
Reinforcement learning	High fitness ≠ understanding; can overfit to the evaluator or workload
LLM world models (K-Search style)	Implicit, learned value estimates; no structured falsifiability; no "where should this fail?"

The target is not a faster artifact. The target is a validated understanding of which mechanisms work, why they work, where they break, and how to transfer that understanding to real systems.

Foundation: Strategy Evolution (the Original Single-Loop)

Strategy Evolution is a structured iterative search methodology. The central idea: a strategy is a hypothesis bundle. Every candidate mechanism is formulated as a set of testable, falsifiable predictions — designed before any code is written. Prediction errors, not just fitness scores, are the primary signal for learning.

The Hypothesis Bundle

When a candidate strategy is selected, it is decomposed into a multi-arm hypothesis bundle:

Arm	What it tests	Purpose
H-main	Mechanism's predicted effect + causal explanation	Does the strategy work, and why?
H-ablation-{component}	Each component's individual contribution	Which parts matter? Are any redundant?
H-super-additivity	Whether compound effect exceeds sum of parts	Do components interact?
H-control-negative	Where the effect should vanish	Confirms mechanism specificity
H-robustness	Generalization across workloads, resources, scale	Where does the strategy break?

Each arm includes a diagnostic clause — "if this fails, it indicates X" — that directs investigation when predictions don't match outcomes. Ablation hypotheses are designed before code is written, preventing confirmation bias: you predict each component's contribution before seeing whether the compound strategy works.

Pre-commit ablation is the key discipline. If you can't articulate what removing a component should do, you don't understand the mechanism well enough to implement it.

The Five-Phase Loop

Phase 1: Problem Framing
  → Write problem.md: baseline, workload, quantitative success criteria, constraints,
    prior knowledge inventory. Design workload to prevent metric gaming.

Phase 2: Hypothesis Bundle Design
  → Generate 2–3 candidate strategies, each self-critiqued and reviewed by multiple
    independent judges (human or LLM). Select winner by consensus.
  → Decompose winner into hypothesis bundle (H-main, ablations, controls, robustness).
  → Convergence-gated Design Review. Human approval gate — hard stop before implementation.

Phase 3: Implement and Verify
  → Implement strategy code + experiment code for all arms.
  → Code Review before running experiments.
  → Execute all arms across 3+ seeds. Compare predictions to outcomes arm-by-arm.
  → Document FINDINGS.md. 10-perspective FINDINGS Review.
  → Record in ledger (one row per iteration, with prediction accuracy column).

Phase 4: Bayesian Parameter Optimization
  → For confirmed mechanisms only: Gaussian process over parameter space.
  → Separates mechanism design (human creativity) from parameter tuning (machine search).
  → Every confirmed strategy gets optimized, so comparisons are fair.

Phase 5: Principle Extraction and Iteration
  → Distill numbered principles from confirmed AND refuted predictions.
  → Refuted predictions are the most valuable: discrepancy reveals something not understood.
  → Principles become hard constraints on subsequent iterations.
  → Stop when consecutive iterations produce null/marginal results.

The ledger is the single source of truth: one row per iteration, never deleted. Failed approaches are as valuable as successes — the ledger makes the full exploration path auditable and prevents re-exploring failed mechanisms.

Example results from two parallel tracks on LLM inference serving:

Track	Iterations	Best result	Key discovery
Scheduling	11	−73.7% critical TTFT P99	Priority is zero-sum; admission is non-zero-sum
Routing	22	−341x TTFT P99 (P/D disaggregation)	KV-utilization scorer counterproductive under pressure

Both tracks independently converged on SLO-gated admission control as the breakthrough "third lever."

Three Limitations of the Original Design

The original Strategy Evolution loop has three structural gaps:

1. No frontier management. The original loop picks one candidate per iteration and commits to it. This is greedy: if two mechanisms are both promising but one gets selected first, the other may never be explored. Re-exploration happens accidentally rather than systematically.

2. No sim-to-real discipline. The original loop stays entirely in simulation. Sim-to-real transfer was informal and opportunistic — there was no principled answer to "when should we run a real experiment, and what kind?" Real GPU-hours are scarce; their use should be explicit and cost-aware.

3. Simulator treated as fixed. The simulator is used as an evaluation harness but its fidelity is never systematically improved. Over time, the simulator's blind spots become the methodology's blind spots.

Agentic Strategy Evolution: Three-Loop Architecture

The extended methodology addresses these limitations by adding two outer loops around the original inner loop:

┌─────────────────────────────────────────────────────────────────────┐
│                         INNER LOOP                                   │
│                  (simulation, fast, cheap)                           │
│                                                                      │
│  Hypothesis Frontier  →  Bundle Design  →  Sim Evaluation           │
│        ↑                                        │                   │
│   Principles +         ←─────────────────  Prediction Error         │
│   Frontier Update                              Analysis              │
└──────────────────────────────┬──────────────────────────────────────┘
                               │ Promote candidates
                               ▼
┌─────────────────────────────────────────────────────────────────────┐
│                         OUTER LOOP                                   │
│              (real system, expensive, VoI-governed)                  │
│                                                                      │
│  VoI Selector  →  Real Experiment  →  Transfer Analysis             │
│       ↑         (Type A/B/C)               │                        │
│  Coverage + Cost   ←───────────────  Discrepancy Attribution        │
└──────────────────────────────┬──────────────────────────────────────┘
                               │ Simulator update signals
                               ▼
┌─────────────────────────────────────────────────────────────────────┐
│                   SIMULATOR EVOLUTION LOOP                           │
│           (structural + parametric + data + fidelity map)           │
│                                                                      │
│  Structure Updates  │  Coefficient Fitting  │  Fidelity Map         │
│  (physics / logic)  │  (model parameters)   │  (where to trust)    │
└─────────────────────────────────────────────────────────────────────┘
                               │ Updated simulator + trust map
                               ▼
                    ┌──────────────────────┐
                    │   Knowledge Stores   │
                    │  • Mechanism prncpls │
                    │  • Fidelity prncpls  │
                    │  • Transfer ledger   │
                    │  • Frontier archive  │
                    └──────────┬───────────┘
                               │ Constrains
                               ▼
                         (back to inner loop)

Inner Loop: Hypothesis Frontier Search

The inner loop is the original five-phase Strategy Evolution cycle, extended with a frontier manager and a verifier gate.

The Hypothesis Frontier

Instead of generating one candidate per iteration and immediately committing, maintain a scored frontier of partially-specified hypothesis bundles:

frontier_score(bundle) =
    expected_value        # predicted performance gain if confirmed
  + uncertainty_bonus     # prefer bundles with high epistemic value
  + novelty_bonus         # prefer mechanistically unexplored regions
  - eval_cost             # simulator budget required to evaluate
  - principle_violation   # penalty for contradicting known principles

The highest-scoring bundle is selected for the current iteration. The frontier is expanded by generating new candidates, scored, and pruned when bundles are dominated or refuted.

Why this matters: Sequential commitment causes re-exploration of similar mechanisms when early candidates succeed. A frontier explicitly tracks what has been tried and prevents the search from collapsing to a narrow mechanism family prematurely — a failure mode called super-additivity blindness, where a family of good mechanisms eclipses a different family that is also good but unexplored.

Diversity Preservation

Maintain a diversity archive indexed by:

Mechanistic family — e.g., admission-heavy, queue-depth-dominant, prefix-affinity
Regime behavior — e.g., works only under memory pressure, works at saturation, regime-independent
Performance frontier — Pareto-optimal bundles across competing metric tradeoffs

When selecting the next bundle, prefer nodes that are both high-scoring and mechanistically distinct from recently evaluated bundles.

The Verifier Gate

Before committing to full simulation, run lightweight verifiers on each candidate bundle:

Verifier	What it checks
Bundle consistency	All arms (H-main, ablations, controls) are mutually coherent
Principle violations	Bundle doesn't contradict established principles without justification
Control validity	Negative-control conditions are achievable
Plausibility	Predicted effect sizes are within physically plausible range
Analyzer sanity	Analysis script can be written before running experiments

Bundles that fail a verifier are repaired or pruned before expensive simulation runs.

Outer Loop: VoI-Governed Real-World Experiment Selection

Real experiments (GPU-hours) are scarce. The outer loop makes their use principled by selecting the experiment with the highest expected research value per unit cost.

The ROI Formula

ROI(e) = E[VoI(e)] / Cost(e)

VoI(e) = w_m · ΔU_mechanism    # reduces uncertainty about whether a promoted candidate works
       + w_s · ΔU_simulator    # helps calibrate or repair the simulator
       + w_r · ΔU_ranking      # helps rank frontier candidates correctly
       + w_b · ΔU_boundary     # identifies where a mechanism starts helping or failing
       + w_d · ΔU_deployment   # reduces uncertainty for a likely production choice

Cost(e) = gpu_hours
        + λ · eng_hours        # engineering setup and analysis time
        + μ · calendar_delay   # queueing delay for scarce cluster access

This reframes real experimentation from "validate the current best candidate" to "choose the most informative experiment available."

Three Experiment Types

Type	What it tests	When to prefer
A — Transfer run	Does the simulated win transfer end-to-end to reality?	Uncertainty is about whether a promoted mechanism survives real-system effects
B — Calibration probe	Is a specific simulator submodel accurate?	Uncertainty is concentrated in one shared submodel affecting many candidates
C — Coverage experiment	Does the simulator behave correctly in a new regime?	A workload family or concurrency band is uncharacterized

Why type-B probes often have the best amortized value: A single type-A run informs one promoted policy. A type-B probe that resolves one shared simulator uncertainty (e.g., decode-phase overhead model) can improve rank-order prediction for dozens of future candidates. When simulator uncertainty is globally shared across many frontier candidates, probes have better amortized ROI.

Operating rule: Prefer type-B when uncertainty is in a shared submodel; type-A when uncertainty is about whether a mechanism survives end-to-end; type-C when the primary gap is regime coverage.

The Transfer Bundle

For every promoted candidate, produce a transfer bundle — the outer-loop analogue of a hypothesis bundle:

mechanism_claim: "SLO-gated admission reduces critical latency by preventing
                  low-value work from saturating service capacity."

transfer_claim: "Improvement direction transfers; absolute magnitude may
                 shrink under real backpressure."

fidelity_expectation:
  rank_order: high         # simulator should correctly rank this above baseline
  direction: high          # improvement direction should hold
  magnitude: medium        # absolute latency values may differ
  boundary: medium         # load threshold where gains appear may shift

simulator_blind_spots:
  - network contention between service components
  - burst amplification from chunked processing
  - cache fragmentation under concurrent sessions

diagnostic_mismatch_clause: "If simulator shows gain but real system does not,
  inspect admission-delay overhead and per-token burst amplification."

data_capture_plan:
  required_observables:
    - per-step latency breakdown
    - queue depth evolution over time
    - admission accept/reject rates
  calibration_targets:
    - service time coefficient
    - admission overhead constant

Dual-Purpose Real Runs

Every outer-loop run should simultaneously:

Validate the promoted mechanism
Test simulator fidelity hypotheses
Collect data for coefficient fitting and workload modeling

This avoids paying for expensive experiment time twice.

Fidelity Hypothesis Classes

In addition to mechanism hypotheses, define fidelity hypotheses that explicitly test what the simulator gets right:

Arm	What it tests
H-fidelity-rank	Simulator correctly ranks top-k candidates
H-fidelity-direction	Simulator preserves improvement direction over baseline
H-fidelity-boundary	Simulator correctly predicts the regime where gains appear or vanish
H-fidelity-gap	Simulator overestimates or underestimates magnitude by no more than X%

Every serious outer-loop campaign should test both a mechanism hypothesis and the corresponding fidelity hypothesis simultaneously.

Simulator Evolution Loop

The simulator is not a fixed evaluation harness — it is a scientific instrument that co-evolves with the mechanisms it is used to study.

The Key Reframing

Stop asking: "Is the simulator accurate?"

Start asking: "Where is the simulator reliable enough for which decisions?"

A simulator does not need to be perfect. It needs to be trustworthy for specific decisions:

Inner-loop pruning requires rank-order fidelity
Publication claims require directional fidelity
Deployment recommendations may require regime-boundary fidelity
Absolute SLO targets require magnitude fidelity

The fidelity map records these distinctions explicitly.

Five Update Types

Layer	What changes	When to update
Structural model	Physics, logic, causal mechanisms — queueing disciplines, batching, memory eviction, communication	Systematic directional mismatch; wrong regime boundary; missing interaction effect
Parametric model	Fitted coefficients — step-time parameters, overhead constants, latency multipliers	Correct direction but wrong magnitude; consistent bias
Data model	Workload distributions — arrival processes, token length distributions, burstiness, cache reuse patterns	Simulator works on synthetic workloads but fails on real traces
Observability	Traces, counters, per-step breakdowns, event logs	Cannot explain discrepancies; multiple causal explanations possible
Fidelity map	Trust regions — where the simulator is reliable for which decision types	After outer-loop validation; accumulated evidence

Discrepancy Attribution

Every sim-to-real mismatch requires attribution before acting:

Root cause	Implication
Mechanism theory wrong	Strategy was wrong; update frontier, extract principle
Simulator missing causal factor	Structural model update needed
Coefficients mis-calibrated	Parametric re-fitting needed
Experiment mapping poor	Real run under different conditions than intended
Hidden deployment constraint	Blind-spot; update transfer bundle template for future

Never attribute discrepancy before investigation. The transfer bundle's diagnostic mismatch clause guides this.

Three Knowledge Stores

All three loops write to shared state that feeds back into inner-loop frontier scoring:

Mechanism knowledge store: Confirmed and refuted hypothesis bundles, extracted design principles, regime-specific applicability boundaries.

Simulator knowledge store: Validated fidelity claims per submodel, known blind spots, calibration dataset versions, fidelity map.

Transfer ledger: One row per outer-loop experiment — simulation prediction, real-system outcome, discrepancy, attribution, update triggered.

What This Is and Isn't

Not a World Model

Recent agentic search systems use LLMs as world models — learned predictors of which actions will produce better artifacts, guiding the search frontier.

This methodology uses a different construct: a predictive causal model built through structured experimentation.

Property	LLM World Model	Predictive Causal Model
Representation	Learned, implicit, neural	Explicit, structured, auditable
Update signal	Fitness score	Prediction error + causal attribution
Primary output	Better search prioritization	Better causal understanding
Persistent artifact	Updated search tree	Principles catalog + fidelity map
Falsifiability	Not directly	Designed in (negative controls, ablations)

The key distinction: the predictive causal model explicitly encodes where a mechanism should fail — through negative controls, regime boundaries, and ablations. A world model does not naturally represent "where this mechanism should not help."

Not Bayesian Optimization

BO (Phase 4) is one component of this methodology. BO tunes numeric values given a confirmed mechanism. This methodology discovers which mechanisms are worth tuning in the first place — and separates mechanism design from parameter tuning so that comparisons are fair.

Not Reinforcement Learning

RL over the policy search problem has appeal but two problems in complex system design:

The state space is discontinuous — policy logic, decision predicates, and scheduling disciplines are not parameterized by a common continuous space.
High fitness does not imply understanding. An RL agent can discover a high-performing policy that exploits evaluator artifacts or overfits to a specific workload, without the methodology having learned why it works or where it will fail.

An iteration that ends in a refuted H-main with a clear diagnostic clause is more valuable than a confirmed iteration whose mechanism is not understood.

Applying to a New Domain

The three-loop structure is domain-agnostic. To apply it:

Build a fast, deterministic evaluator — simulator, benchmark, or test harness that accepts parameterized configuration and produces machine-parseable metrics. Must run fast enough for 100–200 evaluations per mechanism.
Write problem.md — baseline, workload, quantitative success criteria, constraints. Design the workload to prevent metric gaming (e.g., if strategies can use a proxy signal to shortcut the actual target, eliminate that proxy from the workload).
Start the ledger — one row per iteration, prediction accuracy column, never delete rows.
Run the inner loop — generate candidates with multi-judge review, decompose the winner into a hypothesis bundle with predictions before any code is written, review, implement, execute all arms, compare predictions to outcomes, extract principles from both confirmations and refutations.
Add the outer loop when sim-to-real transfer matters — define fidelity hypotheses, build transfer bundles for promoted candidates, select real experiments by ROI, attribute discrepancies before updating the simulator.
Stop when consecutive iterations produce null results — the principles catalog is the durable output.

Open Questions for Review

1. Frontier scoring calibration.
The frontier score combines expected value, uncertainty, novelty, cost, and principle penalty. How should these weights be set in practice? Should they be learned from the ledger, or specified manually per domain?

2. Fidelity hypothesis coverage.
The proposed fidelity arms are rank, direction, boundary, and gap. Are these the right axes? Which matters most for inner-loop pruning decisions?

3. Promotion policy.
Which simulated wins should be promoted to real-system validation? The document sketches criteria (strong sim win, high uncertainty, high upside, likely to expose simulator weakness) but a concrete promotion rule is not yet specified.

4. Bundle overhead at scale.
The hypothesis bundle structure adds significant evaluation cost. How should the frontier update when a bundle's H-main is refuted after passing the verifier? Should partial-failure bundles (H-main confirmed, ablations inconclusive) count as promoted candidates?

5. Adversarial workload generation.
Should a red-team agent generate workload conditions where a mechanism should fail, to tighten H-control-negative and H-robustness arms? Right for all iterations, or only for outer-loop promotion candidates?

6. Simulator co-evolution governance.
Should structural simulator updates require a formal change proposal with its own review cycle? Or should they be fast engineering changes triggered by discrepancy analysis?

7. Four-level fidelity hierarchy.
The document proposes type-A transfer runs, type-B probes, type-C coverage experiments. Should there be a fourth level — production A/B tests — sitting above type-A for deployment validation?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agentic Strategy Evolution: a three-loop methodology for optimizing multi-layer policy spaces #1

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Agentic Strategy Evolution: a three-loop methodology for optimizing multi-layer policy spaces #1

Uh oh!

sriumcp Mar 27, 2026 Maintainer

The Problem This Solves

Foundation: Strategy Evolution (the Original Single-Loop)

The Hypothesis Bundle

The Five-Phase Loop

Three Limitations of the Original Design

Agentic Strategy Evolution: Three-Loop Architecture

Inner Loop: Hypothesis Frontier Search

The Hypothesis Frontier

Diversity Preservation

The Verifier Gate

Outer Loop: VoI-Governed Real-World Experiment Selection

The ROI Formula

Three Experiment Types

The Transfer Bundle

Dual-Purpose Real Runs

Fidelity Hypothesis Classes

Simulator Evolution Loop

The Key Reframing

Five Update Types

Discrepancy Attribution

Three Knowledge Stores

What This Is and Isn't

Not a World Model

Not Bayesian Optimization

Not Reinforcement Learning

Applying to a New Domain

Open Questions for Review

Replies: 0 comments

sriumcp
Mar 27, 2026
Maintainer