Wasted effort in agentic strategy evolution frameworks - proposed definition + first instantiations #100

vishakha-ramani · 2026-05-18T22:32:04Z

vishakha-ramani
May 18, 2026

The first-order question is:

Do frameworks like openevolve/GEPA/NOUS actually waste effort, and if so how much and of what kind?

I want to first define what 'waste' even means rigorously, then show the definition is useful by measuring.
There's no single obvious definition. A few possible anchors:

Counterfactual. Waste = work whose absence wouldn't have changed the final outcome. Conceptually clean, but expensive — you need replays or ablations.
Footprint. Waste = work whose product doesn't appear in the final artifact (program never made the archive, principle never made the store). Cheap from logs, but conflates "didn't survive" with "didn't help."
No information about the target. Waste = work that didn't reduce the system's uncertainty about what it's trying to learn. Cleanest, but requires knowing what the framework's actually trying to learn, which is the right thing to ask anyway.

I'm working with the last one:

Waste = effort that produces no marginal reduction in uncertainty about the search objective, immediate or delayed.

Three things this forces us to specify, per framework:

What is the framework actually trying to learn?
How do we represent the system's belief about that target?
What counts as a "shift" in that belief?

There's also a fourth piece, shared across frameworks. A delayed-credit rule. A call that didn't shift belief immediately but enabled a later call that did gets a discounted share of the later shift. Default discount is 70%. Without this rule, any framework that stages information across calls gets falsely flagged as wasteful.

The targets are different — and that matters

Framework	Target	Belief representation	What counts as a shift
openevolve	The winning candidate	A probability distribution over candidates, estimated from logged evaluator scores via a small surrogate model	Probability mass moves: consolidating onto the eventual winner, ruling out a candidate, surfacing a new one
Nous	The final principle store	The current `principles.json`	Structural changes only: insert /prune / direction-of-mechanism revision. Confidence sharpening doesn't count.

We measure waste in LLM tokens, input plus output, with dollar cost as a direct derivative. Tokens are the dominant cost, are already logged by every framework's LLM client, and stay reproducible across machines.

Not all waste is equal. Three tiers.

A finding I want to surface up front. Not every category of waste under our definition is equally interesting to measure. We propose splitting reported waste into three tiers, because conflating them obscures both engineering opportunities and structural design choices.

Tier 1, hygiene waste. Literal duplicates that a deterministic check would catch. For example, DESIGN proposes an H-main whose text is essentially identical to an existing principle. A hash-based pre-check would catch it in roughly ten lines of code.
Tier 2, semantic redundancy. Hypotheses that are semantically equivalent to existing principles but phrased differently, produce only confidence updates when run, and consume real tokens in the process. For example, iteration 5's H-main targets the same mechanism as iteration 2's principle but rephrased. EXECUTE_ANALYZE re-confirms, and principle_updates.json records only a confidence-sharpening update.

Tier 2 is reducible by an LLM-judge or semantic-similarity dedup check, but with a real false-positive risk. Some duplicate-looking hypotheses actually expose new regimes and produce structural changes such as prunes or direction revisions. The proposal can characterize the realistic ceiling on semantic dedup directly from data. Among semantic-duplicate H-mains, what fraction produce only confidence updates (catch them) vs. structural changes (don't)?

Tier 3, structural or methodology-mandated. Arms required by methodology even when unlikely to produce structural changes. For example, after H-main confirms the bottleneck mechanism, the bundle sizing rules require an H-robustness arm at three workload sizes. The mechanism holds at all three. principle_updates.json records only a confidence-sharpening update.

This arm is structurally wasted under the metric, but it isn't a Planner bug. The arm was required for falsifiability and rigor. The metric is surfacing a real methodology vs. information-efficiency tension that Nous accepts deliberately. Certain arms are non-optional even when unlikely to produce structural changes. We don't judge whether the trade-off is right. We just quantify what it costs. Likely the largest tier-3 line item on the Nous side.

openevolve under the same lens

The same tiering applies.

Tier 1. Literal duplicate programs (identical AST). A behavioral hash catches them.
Tier 2. Programs that are syntactically different but behaviorally near-identical. Same probe outputs, same score band. The surrogate sees no shift in winner-belief, so they're wasted under the definition. A behavioral-embedding dedup check could catch many, with the same false-positive risk for genuine refinements that happen to share behavior.
Tier 3. Mutations that the framework's own sampling policy should arguably not have generated. For example, inspirations sampled from regions the surrogate already places below winner-threshold.

What we are deliberately not calling waste

Failed calls. Errors, crashes, validation-retry loops. Those tokens recover from process failures, so they count as failure cost and are reported separately. Worth flagging because validation-retry loops intuitively feel like waste, but under our definition they aren't, and a metric measuring informational waste shouldn't be conflated with one measuring failure recovery.

Substrate cost. If an LLM does work that deterministic code could have done (agent reasoning, handoff curation, prompt assembly), that's a substrate choice rather than waste. Those tokens are expensive but captured in normal cost reporting.

Cross-framework comparisons. The targets differ. "openevolve has X% waste, Nous has Y%" is not a fair comparison and the proposal won't make it. Per-framework profiles, reported separately.

Where I might be wrong

I want to be honest about a few soft spots in this framing.

First, "obvious waste" examples can fail the definition. I initially used validation-retry loops as the canonical obviously-wasted example, but under the definition they're failure cost, not informational waste. The metric asks "did this call shift belief about the target" and a retry loop is about recovering from a process failure. Different category.

Second, "confidence sharpening doesn't count as a belief shift" is a stipulation. A tighter posterior is literally a reduction in uncertainty. Calling it not a shift is a definitional choice. If confidence updates should count, Nous's waste rate drops substantially.

Third, the substrate-cost carve-out may systematically under-count waste. If an LLM does work that deterministic code could have done, and that work doesn't feed any structural change downstream, those tokens are exempt. The carve-out keeps the metric from punishing deliberate substrate choices, but it leans toward under-counting.

If you spot more soft spots, especially places I'm calling things waste that aren't, or missing waste that is, I want to hear them.

Where I'd love your input

The core question I want your read on:
Does Nous have wasted effort under this definition? If yes, when and where do you see it most clearly?

Concrete examples from campaigns you've run would be gold. Iteration numbers, arm types, whatever shape they took.

A few related questions while you're thinking about it:

Does treating confidence-only updates as no-shift match your intuition, or does it under-count something real?
Other frameworks. GEPA fits the same skeleton. AdaEvolve and EvoX may need their own variants. v1
or follow-up?
Is tokens the right unit? I'd welcome pushback. Executor sessions that run multi-hour external experiments may not be well-captured by tokens alone.
What am I missing? Categories of waste I haven't named, definitional issues I'm dodging, examples that don't fit cleanly anywhere.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wasted effort in agentic strategy evolution frameworks - proposed definition + first instantiations #100

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Wasted effort in agentic strategy evolution frameworks - proposed definition + first instantiations #100

Uh oh!

vishakha-ramani May 18, 2026

The targets are different — and that matters

Not all waste is equal. Three tiers.

openevolve under the same lens

What we are deliberately not calling waste

Where I might be wrong

Where I'd love your input

Replies: 0 comments

vishakha-ramani
May 18, 2026