You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Status: Draft for review — feedback welcome on structure, framing, and open questions.
This methodology was developed for LLM inference serving but is domain-agnostic — the three-loop structure applies to any complex system with multiple interacting policy layers and a simulation-based evaluation harness.
The Problem This Solves
In systems with multiple interacting policy layers — routing, scheduling, memory management, admission control — the optimal configuration cannot be derived analytically. Interactions produce non-obvious emergent behaviors: super-additive effects (two mechanisms together outperform their sum), signal cancellation (adding a seemingly relevant signal makes things worse), and regime-dependent dominance (the right answer changes with load).
Classical approaches have blind spots:
Approach
What it misses
Grid/random search
Finds what works, not why it works or where it breaks
Bayesian optimization
Tunes parameters of a given mechanism; doesn't discover which mechanisms are worth tuning
Reinforcement learning
High fitness ≠ understanding; can overfit to the evaluator or workload
LLM world models (K-Search style)
Implicit, learned value estimates; no structured falsifiability; no "where should this fail?"
The target is not a faster artifact. The target is a validated understanding of which mechanisms work, why they work, where they break, and how to transfer that understanding to real systems.
Foundation: Strategy Evolution (the Original Single-Loop)
Strategy Evolution is a structured iterative search methodology. The central idea: a strategy is a hypothesis bundle. Every candidate mechanism is formulated as a set of testable, falsifiable predictions — designed before any code is written. Prediction errors, not just fitness scores, are the primary signal for learning.
The Hypothesis Bundle
When a candidate strategy is selected, it is decomposed into a multi-arm hypothesis bundle:
Arm
What it tests
Purpose
H-main
Mechanism's predicted effect + causal explanation
Does the strategy work, and why?
H-ablation-{component}
Each component's individual contribution
Which parts matter? Are any redundant?
H-super-additivity
Whether compound effect exceeds sum of parts
Do components interact?
H-control-negative
Where the effect should vanish
Confirms mechanism specificity
H-robustness
Generalization across workloads, resources, scale
Where does the strategy break?
Each arm includes a diagnostic clause — "if this fails, it indicates X" — that directs investigation when predictions don't match outcomes. Ablation hypotheses are designed before code is written, preventing confirmation bias: you predict each component's contribution before seeing whether the compound strategy works.
Pre-commit ablation is the key discipline. If you can't articulate what removing a component should do, you don't understand the mechanism well enough to implement it.
The Five-Phase Loop
Phase 1: Problem Framing
→ Write problem.md: baseline, workload, quantitative success criteria, constraints,
prior knowledge inventory. Design workload to prevent metric gaming.
Phase 2: Hypothesis Bundle Design
→ Generate 2–3 candidate strategies, each self-critiqued and reviewed by multiple
independent judges (human or LLM). Select winner by consensus.
→ Decompose winner into hypothesis bundle (H-main, ablations, controls, robustness).
→ Convergence-gated Design Review. Human approval gate — hard stop before implementation.
Phase 3: Implement and Verify
→ Implement strategy code + experiment code for all arms.
→ Code Review before running experiments.
→ Execute all arms across 3+ seeds. Compare predictions to outcomes arm-by-arm.
→ Document FINDINGS.md. 10-perspective FINDINGS Review.
→ Record in ledger (one row per iteration, with prediction accuracy column).
Phase 4: Bayesian Parameter Optimization
→ For confirmed mechanisms only: Gaussian process over parameter space.
→ Separates mechanism design (human creativity) from parameter tuning (machine search).
→ Every confirmed strategy gets optimized, so comparisons are fair.
Phase 5: Principle Extraction and Iteration
→ Distill numbered principles from confirmed AND refuted predictions.
→ Refuted predictions are the most valuable: discrepancy reveals something not understood.
→ Principles become hard constraints on subsequent iterations.
→ Stop when consecutive iterations produce null/marginal results.
The ledger is the single source of truth: one row per iteration, never deleted. Failed approaches are as valuable as successes — the ledger makes the full exploration path auditable and prevents re-exploring failed mechanisms.
Example results from two parallel tracks on LLM inference serving:
Track
Iterations
Best result
Key discovery
Scheduling
11
−73.7% critical TTFT P99
Priority is zero-sum; admission is non-zero-sum
Routing
22
−341x TTFT P99 (P/D disaggregation)
KV-utilization scorer counterproductive under pressure
Both tracks independently converged on SLO-gated admission control as the breakthrough "third lever."
Three Limitations of the Original Design
The original Strategy Evolution loop has three structural gaps:
1. No frontier management. The original loop picks one candidate per iteration and commits to it. This is greedy: if two mechanisms are both promising but one gets selected first, the other may never be explored. Re-exploration happens accidentally rather than systematically.
2. No sim-to-real discipline. The original loop stays entirely in simulation. Sim-to-real transfer was informal and opportunistic — there was no principled answer to "when should we run a real experiment, and what kind?" Real GPU-hours are scarce; their use should be explicit and cost-aware.
3. Simulator treated as fixed. The simulator is used as an evaluation harness but its fidelity is never systematically improved. Over time, the simulator's blind spots become the methodology's blind spots.
The inner loop is the original five-phase Strategy Evolution cycle, extended with a frontier manager and a verifier gate.
The Hypothesis Frontier
Instead of generating one candidate per iteration and immediately committing, maintain a scored frontier of partially-specified hypothesis bundles:
frontier_score(bundle) =
expected_value # predicted performance gain if confirmed
+ uncertainty_bonus # prefer bundles with high epistemic value
+ novelty_bonus # prefer mechanistically unexplored regions
- eval_cost # simulator budget required to evaluate
- principle_violation # penalty for contradicting known principles
The highest-scoring bundle is selected for the current iteration. The frontier is expanded by generating new candidates, scored, and pruned when bundles are dominated or refuted.
Why this matters: Sequential commitment causes re-exploration of similar mechanisms when early candidates succeed. A frontier explicitly tracks what has been tried and prevents the search from collapsing to a narrow mechanism family prematurely — a failure mode called super-additivity blindness, where a family of good mechanisms eclipses a different family that is also good but unexplored.
Diversity Preservation
Maintain a diversity archive indexed by:
Mechanistic family — e.g., admission-heavy, queue-depth-dominant, prefix-affinity
Regime behavior — e.g., works only under memory pressure, works at saturation, regime-independent
Performance frontier — Pareto-optimal bundles across competing metric tradeoffs
When selecting the next bundle, prefer nodes that are both high-scoring and mechanistically distinct from recently evaluated bundles.
The Verifier Gate
Before committing to full simulation, run lightweight verifiers on each candidate bundle:
Verifier
What it checks
Bundle consistency
All arms (H-main, ablations, controls) are mutually coherent
Principle violations
Bundle doesn't contradict established principles without justification
Control validity
Negative-control conditions are achievable
Plausibility
Predicted effect sizes are within physically plausible range
Analyzer sanity
Analysis script can be written before running experiments
Bundles that fail a verifier are repaired or pruned before expensive simulation runs.
Real experiments (GPU-hours) are scarce. The outer loop makes their use principled by selecting the experiment with the highest expected research value per unit cost.
The ROI Formula
ROI(e) = E[VoI(e)] / Cost(e)
VoI(e) = w_m · ΔU_mechanism # reduces uncertainty about whether a promoted candidate works
+ w_s · ΔU_simulator # helps calibrate or repair the simulator
+ w_r · ΔU_ranking # helps rank frontier candidates correctly
+ w_b · ΔU_boundary # identifies where a mechanism starts helping or failing
+ w_d · ΔU_deployment # reduces uncertainty for a likely production choice
Cost(e) = gpu_hours
+ λ · eng_hours # engineering setup and analysis time
+ μ · calendar_delay # queueing delay for scarce cluster access
This reframes real experimentation from "validate the current best candidate" to "choose the most informative experiment available."
Three Experiment Types
Type
What it tests
When to prefer
A — Transfer run
Does the simulated win transfer end-to-end to reality?
Uncertainty is about whether a promoted mechanism survives real-system effects
B — Calibration probe
Is a specific simulator submodel accurate?
Uncertainty is concentrated in one shared submodel affecting many candidates
C — Coverage experiment
Does the simulator behave correctly in a new regime?
A workload family or concurrency band is uncharacterized
Why type-B probes often have the best amortized value: A single type-A run informs one promoted policy. A type-B probe that resolves one shared simulator uncertainty (e.g., decode-phase overhead model) can improve rank-order prediction for dozens of future candidates. When simulator uncertainty is globally shared across many frontier candidates, probes have better amortized ROI.
Operating rule: Prefer type-B when uncertainty is in a shared submodel; type-A when uncertainty is about whether a mechanism survives end-to-end; type-C when the primary gap is regime coverage.
The Transfer Bundle
For every promoted candidate, produce a transfer bundle — the outer-loop analogue of a hypothesis bundle:
mechanism_claim: "SLO-gated admission reduces critical latency by preventing low-value work from saturating service capacity."transfer_claim: "Improvement direction transfers; absolute magnitude may shrink under real backpressure."fidelity_expectation:
rank_order: high # simulator should correctly rank this above baselinedirection: high # improvement direction should holdmagnitude: medium # absolute latency values may differboundary: medium # load threshold where gains appear may shiftsimulator_blind_spots:
- network contention between service components
- burst amplification from chunked processing
- cache fragmentation under concurrent sessionsdiagnostic_mismatch_clause: "If simulator shows gain but real system does not, inspect admission-delay overhead and per-token burst amplification."data_capture_plan:
required_observables:
- per-step latency breakdown
- queue depth evolution over time
- admission accept/reject ratescalibration_targets:
- service time coefficient
- admission overhead constant
Dual-Purpose Real Runs
Every outer-loop run should simultaneously:
Validate the promoted mechanism
Test simulator fidelity hypotheses
Collect data for coefficient fitting and workload modeling
This avoids paying for expensive experiment time twice.
Fidelity Hypothesis Classes
In addition to mechanism hypotheses, define fidelity hypotheses that explicitly test what the simulator gets right:
Arm
What it tests
H-fidelity-rank
Simulator correctly ranks top-k candidates
H-fidelity-direction
Simulator preserves improvement direction over baseline
H-fidelity-boundary
Simulator correctly predicts the regime where gains appear or vanish
H-fidelity-gap
Simulator overestimates or underestimates magnitude by no more than X%
Every serious outer-loop campaign should test both a mechanism hypothesis and the corresponding fidelity hypothesis simultaneously.
Simulator Evolution Loop
The simulator is not a fixed evaluation harness — it is a scientific instrument that co-evolves with the mechanisms it is used to study.
The Key Reframing
Stop asking: "Is the simulator accurate?"
Start asking: "Where is the simulator reliable enough for which decisions?"
A simulator does not need to be perfect. It needs to be trustworthy for specific decisions:
Inner-loop pruning requires rank-order fidelity
Publication claims require directional fidelity
Deployment recommendations may require regime-boundary fidelity
Absolute SLO targets require magnitude fidelity
The fidelity map records these distinctions explicitly.
Simulator knowledge store: Validated fidelity claims per submodel, known blind spots, calibration dataset versions, fidelity map.
Transfer ledger: One row per outer-loop experiment — simulation prediction, real-system outcome, discrepancy, attribution, update triggered.
What This Is and Isn't
Not a World Model
Recent agentic search systems use LLMs as world models — learned predictors of which actions will produce better artifacts, guiding the search frontier.
This methodology uses a different construct: a predictive causal model built through structured experimentation.
Property
LLM World Model
Predictive Causal Model
Representation
Learned, implicit, neural
Explicit, structured, auditable
Update signal
Fitness score
Prediction error + causal attribution
Primary output
Better search prioritization
Better causal understanding
Persistent artifact
Updated search tree
Principles catalog + fidelity map
Falsifiability
Not directly
Designed in (negative controls, ablations)
The key distinction: the predictive causal model explicitly encodes where a mechanism should fail — through negative controls, regime boundaries, and ablations. A world model does not naturally represent "where this mechanism should not help."
Not Bayesian Optimization
BO (Phase 4) is one component of this methodology. BO tunes numeric values given a confirmed mechanism. This methodology discovers which mechanisms are worth tuning in the first place — and separates mechanism design from parameter tuning so that comparisons are fair.
Not Reinforcement Learning
RL over the policy search problem has appeal but two problems in complex system design:
The state space is discontinuous — policy logic, decision predicates, and scheduling disciplines are not parameterized by a common continuous space.
High fitness does not imply understanding. An RL agent can discover a high-performing policy that exploits evaluator artifacts or overfits to a specific workload, without the methodology having learned why it works or where it will fail.
An iteration that ends in a refuted H-main with a clear diagnostic clause is more valuable than a confirmed iteration whose mechanism is not understood.
Applying to a New Domain
The three-loop structure is domain-agnostic. To apply it:
Build a fast, deterministic evaluator — simulator, benchmark, or test harness that accepts parameterized configuration and produces machine-parseable metrics. Must run fast enough for 100–200 evaluations per mechanism.
Write problem.md — baseline, workload, quantitative success criteria, constraints. Design the workload to prevent metric gaming (e.g., if strategies can use a proxy signal to shortcut the actual target, eliminate that proxy from the workload).
Start the ledger — one row per iteration, prediction accuracy column, never delete rows.
Run the inner loop — generate candidates with multi-judge review, decompose the winner into a hypothesis bundle with predictions before any code is written, review, implement, execute all arms, compare predictions to outcomes, extract principles from both confirmations and refutations.
Add the outer loop when sim-to-real transfer matters — define fidelity hypotheses, build transfer bundles for promoted candidates, select real experiments by ROI, attribute discrepancies before updating the simulator.
Stop when consecutive iterations produce null results — the principles catalog is the durable output.
Open Questions for Review
1. Frontier scoring calibration.
The frontier score combines expected value, uncertainty, novelty, cost, and principle penalty. How should these weights be set in practice? Should they be learned from the ledger, or specified manually per domain?
2. Fidelity hypothesis coverage.
The proposed fidelity arms are rank, direction, boundary, and gap. Are these the right axes? Which matters most for inner-loop pruning decisions?
3. Promotion policy.
Which simulated wins should be promoted to real-system validation? The document sketches criteria (strong sim win, high uncertainty, high upside, likely to expose simulator weakness) but a concrete promotion rule is not yet specified.
4. Bundle overhead at scale.
The hypothesis bundle structure adds significant evaluation cost. How should the frontier update when a bundle's H-main is refuted after passing the verifier? Should partial-failure bundles (H-main confirmed, ablations inconclusive) count as promoted candidates?
5. Adversarial workload generation.
Should a red-team agent generate workload conditions where a mechanism should fail, to tighten H-control-negative and H-robustness arms? Right for all iterations, or only for outer-loop promotion candidates?
6. Simulator co-evolution governance.
Should structural simulator updates require a formal change proposal with its own review cycle? Or should they be fast engineering changes triggered by discrepancy analysis?
7. Four-level fidelity hierarchy.
The document proposes type-A transfer runs, type-B probes, type-C coverage experiments. Should there be a fourth level — production A/B tests — sitting above type-A for deployment validation?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
The Problem This Solves
In systems with multiple interacting policy layers — routing, scheduling, memory management, admission control — the optimal configuration cannot be derived analytically. Interactions produce non-obvious emergent behaviors: super-additive effects (two mechanisms together outperform their sum), signal cancellation (adding a seemingly relevant signal makes things worse), and regime-dependent dominance (the right answer changes with load).
Classical approaches have blind spots:
The target is not a faster artifact. The target is a validated understanding of which mechanisms work, why they work, where they break, and how to transfer that understanding to real systems.
Foundation: Strategy Evolution (the Original Single-Loop)
Strategy Evolution is a structured iterative search methodology. The central idea: a strategy is a hypothesis bundle. Every candidate mechanism is formulated as a set of testable, falsifiable predictions — designed before any code is written. Prediction errors, not just fitness scores, are the primary signal for learning.
The Hypothesis Bundle
When a candidate strategy is selected, it is decomposed into a multi-arm hypothesis bundle:
Each arm includes a diagnostic clause — "if this fails, it indicates X" — that directs investigation when predictions don't match outcomes. Ablation hypotheses are designed before code is written, preventing confirmation bias: you predict each component's contribution before seeing whether the compound strategy works.
Pre-commit ablation is the key discipline. If you can't articulate what removing a component should do, you don't understand the mechanism well enough to implement it.
The Five-Phase Loop
The ledger is the single source of truth: one row per iteration, never deleted. Failed approaches are as valuable as successes — the ledger makes the full exploration path auditable and prevents re-exploring failed mechanisms.
Example results from two parallel tracks on LLM inference serving:
Both tracks independently converged on SLO-gated admission control as the breakthrough "third lever."
Three Limitations of the Original Design
The original Strategy Evolution loop has three structural gaps:
1. No frontier management. The original loop picks one candidate per iteration and commits to it. This is greedy: if two mechanisms are both promising but one gets selected first, the other may never be explored. Re-exploration happens accidentally rather than systematically.
2. No sim-to-real discipline. The original loop stays entirely in simulation. Sim-to-real transfer was informal and opportunistic — there was no principled answer to "when should we run a real experiment, and what kind?" Real GPU-hours are scarce; their use should be explicit and cost-aware.
3. Simulator treated as fixed. The simulator is used as an evaluation harness but its fidelity is never systematically improved. Over time, the simulator's blind spots become the methodology's blind spots.
Agentic Strategy Evolution: Three-Loop Architecture
The extended methodology addresses these limitations by adding two outer loops around the original inner loop:
Inner Loop: Hypothesis Frontier Search
The inner loop is the original five-phase Strategy Evolution cycle, extended with a frontier manager and a verifier gate.
The Hypothesis Frontier
Instead of generating one candidate per iteration and immediately committing, maintain a scored frontier of partially-specified hypothesis bundles:
The highest-scoring bundle is selected for the current iteration. The frontier is expanded by generating new candidates, scored, and pruned when bundles are dominated or refuted.
Why this matters: Sequential commitment causes re-exploration of similar mechanisms when early candidates succeed. A frontier explicitly tracks what has been tried and prevents the search from collapsing to a narrow mechanism family prematurely — a failure mode called super-additivity blindness, where a family of good mechanisms eclipses a different family that is also good but unexplored.
Diversity Preservation
Maintain a diversity archive indexed by:
When selecting the next bundle, prefer nodes that are both high-scoring and mechanistically distinct from recently evaluated bundles.
The Verifier Gate
Before committing to full simulation, run lightweight verifiers on each candidate bundle:
Bundles that fail a verifier are repaired or pruned before expensive simulation runs.
Outer Loop: VoI-Governed Real-World Experiment Selection
Real experiments (GPU-hours) are scarce. The outer loop makes their use principled by selecting the experiment with the highest expected research value per unit cost.
The ROI Formula
This reframes real experimentation from "validate the current best candidate" to "choose the most informative experiment available."
Three Experiment Types
Why type-B probes often have the best amortized value: A single type-A run informs one promoted policy. A type-B probe that resolves one shared simulator uncertainty (e.g., decode-phase overhead model) can improve rank-order prediction for dozens of future candidates. When simulator uncertainty is globally shared across many frontier candidates, probes have better amortized ROI.
Operating rule: Prefer type-B when uncertainty is in a shared submodel; type-A when uncertainty is about whether a mechanism survives end-to-end; type-C when the primary gap is regime coverage.
The Transfer Bundle
For every promoted candidate, produce a transfer bundle — the outer-loop analogue of a hypothesis bundle:
Dual-Purpose Real Runs
Every outer-loop run should simultaneously:
This avoids paying for expensive experiment time twice.
Fidelity Hypothesis Classes
In addition to mechanism hypotheses, define fidelity hypotheses that explicitly test what the simulator gets right:
Every serious outer-loop campaign should test both a mechanism hypothesis and the corresponding fidelity hypothesis simultaneously.
Simulator Evolution Loop
The simulator is not a fixed evaluation harness — it is a scientific instrument that co-evolves with the mechanisms it is used to study.
The Key Reframing
A simulator does not need to be perfect. It needs to be trustworthy for specific decisions:
The fidelity map records these distinctions explicitly.
Five Update Types
Discrepancy Attribution
Every sim-to-real mismatch requires attribution before acting:
Never attribute discrepancy before investigation. The transfer bundle's diagnostic mismatch clause guides this.
Three Knowledge Stores
All three loops write to shared state that feeds back into inner-loop frontier scoring:
Mechanism knowledge store: Confirmed and refuted hypothesis bundles, extracted design principles, regime-specific applicability boundaries.
Simulator knowledge store: Validated fidelity claims per submodel, known blind spots, calibration dataset versions, fidelity map.
Transfer ledger: One row per outer-loop experiment — simulation prediction, real-system outcome, discrepancy, attribution, update triggered.
What This Is and Isn't
Not a World Model
Recent agentic search systems use LLMs as world models — learned predictors of which actions will produce better artifacts, guiding the search frontier.
This methodology uses a different construct: a predictive causal model built through structured experimentation.
The key distinction: the predictive causal model explicitly encodes where a mechanism should fail — through negative controls, regime boundaries, and ablations. A world model does not naturally represent "where this mechanism should not help."
Not Bayesian Optimization
BO (Phase 4) is one component of this methodology. BO tunes numeric values given a confirmed mechanism. This methodology discovers which mechanisms are worth tuning in the first place — and separates mechanism design from parameter tuning so that comparisons are fair.
Not Reinforcement Learning
RL over the policy search problem has appeal but two problems in complex system design:
An iteration that ends in a refuted H-main with a clear diagnostic clause is more valuable than a confirmed iteration whose mechanism is not understood.
Applying to a New Domain
The three-loop structure is domain-agnostic. To apply it:
Build a fast, deterministic evaluator — simulator, benchmark, or test harness that accepts parameterized configuration and produces machine-parseable metrics. Must run fast enough for 100–200 evaluations per mechanism.
Write
problem.md— baseline, workload, quantitative success criteria, constraints. Design the workload to prevent metric gaming (e.g., if strategies can use a proxy signal to shortcut the actual target, eliminate that proxy from the workload).Start the ledger — one row per iteration, prediction accuracy column, never delete rows.
Run the inner loop — generate candidates with multi-judge review, decompose the winner into a hypothesis bundle with predictions before any code is written, review, implement, execute all arms, compare predictions to outcomes, extract principles from both confirmations and refutations.
Add the outer loop when sim-to-real transfer matters — define fidelity hypotheses, build transfer bundles for promoted candidates, select real experiments by ROI, attribute discrepancies before updating the simulator.
Stop when consecutive iterations produce null results — the principles catalog is the durable output.
Open Questions for Review
1. Frontier scoring calibration.
The frontier score combines expected value, uncertainty, novelty, cost, and principle penalty. How should these weights be set in practice? Should they be learned from the ledger, or specified manually per domain?
2. Fidelity hypothesis coverage.
The proposed fidelity arms are rank, direction, boundary, and gap. Are these the right axes? Which matters most for inner-loop pruning decisions?
3. Promotion policy.
Which simulated wins should be promoted to real-system validation? The document sketches criteria (strong sim win, high uncertainty, high upside, likely to expose simulator weakness) but a concrete promotion rule is not yet specified.
4. Bundle overhead at scale.
The hypothesis bundle structure adds significant evaluation cost. How should the frontier update when a bundle's H-main is refuted after passing the verifier? Should partial-failure bundles (H-main confirmed, ablations inconclusive) count as promoted candidates?
5. Adversarial workload generation.
Should a red-team agent generate workload conditions where a mechanism should fail, to tighten H-control-negative and H-robustness arms? Right for all iterations, or only for outer-loop promotion candidates?
6. Simulator co-evolution governance.
Should structural simulator updates require a formal change proposal with its own review cycle? Or should they be fast engineering changes triggered by discrepancy analysis?
7. Four-level fidelity hierarchy.
The document proposes type-A transfer runs, type-B probes, type-C coverage experiments. Should there be a fourth level — production A/B tests — sitting above type-A for deployment validation?
Beta Was this translation helpful? Give feedback.
All reactions