Wasted effort in agentic strategy evolution frameworks - proposed definition + first instantiations #100
vishakha-ramani
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
The first-order question is:
Do frameworks like openevolve/GEPA/NOUS actually waste effort, and if so how much and of what kind?
I want to first define what 'waste' even means rigorously, then show the definition is useful by measuring.
There's no single obvious definition. A few possible anchors:
I'm working with the last one:
Three things this forces us to specify, per framework:
There's also a fourth piece, shared across frameworks. A delayed-credit rule. A call that didn't shift belief immediately but enabled a later call that did gets a discounted share of the later shift. Default discount is 70%. Without this rule, any framework that stages information across calls gets falsely flagged as wasteful.
The targets are different — and that matters
principles.jsonWe measure waste in LLM tokens, input plus output, with dollar cost as a direct derivative. Tokens are the dominant cost, are already logged by every framework's LLM client, and stay reproducible across machines.
Not all waste is equal. Three tiers.
A finding I want to surface up front. Not every category of waste under our definition is equally interesting to measure. We propose splitting reported waste into three tiers, because conflating them obscures both engineering opportunities and structural design choices.
Tier 1, hygiene waste. Literal duplicates that a deterministic check would catch. For example, DESIGN proposes an H-main whose text is essentially identical to an existing principle. A hash-based pre-check would catch it in roughly ten lines of code.
Tier 2, semantic redundancy. Hypotheses that are semantically equivalent to existing principles but phrased differently, produce only confidence updates when run, and consume real tokens in the process. For example, iteration 5's H-main targets the same mechanism as iteration 2's principle but rephrased. EXECUTE_ANALYZE re-confirms, and
principle_updates.jsonrecords only a confidence-sharpening update.Tier 2 is reducible by an LLM-judge or semantic-similarity dedup check, but with a real false-positive risk. Some duplicate-looking hypotheses actually expose new regimes and produce structural changes such as prunes or direction revisions. The proposal can characterize the realistic ceiling on semantic dedup directly from data. Among semantic-duplicate H-mains, what fraction produce only confidence updates (catch them) vs. structural changes (don't)?
principle_updates.jsonrecords only a confidence-sharpening update.This arm is structurally wasted under the metric, but it isn't a Planner bug. The arm was required for falsifiability and rigor. The metric is surfacing a real methodology vs. information-efficiency tension that Nous accepts deliberately. Certain arms are non-optional even when unlikely to produce structural changes. We don't judge whether the trade-off is right. We just quantify what it costs. Likely the largest tier-3 line item on the Nous side.
openevolve under the same lens
The same tiering applies.
Tier 1. Literal duplicate programs (identical AST). A behavioral hash catches them.
Tier 2. Programs that are syntactically different but behaviorally near-identical. Same probe outputs, same score band. The surrogate sees no shift in winner-belief, so they're wasted under the definition. A behavioral-embedding dedup check could catch many, with the same false-positive risk for genuine refinements that happen to share behavior.
Tier 3. Mutations that the framework's own sampling policy should arguably not have generated. For example, inspirations sampled from regions the surrogate already places below winner-threshold.
What we are deliberately not calling waste
Failed calls. Errors, crashes, validation-retry loops. Those tokens recover from process failures, so they count as failure cost and are reported separately. Worth flagging because validation-retry loops intuitively feel like waste, but under our definition they aren't, and a metric measuring informational waste shouldn't be conflated with one measuring failure recovery.
Substrate cost. If an LLM does work that deterministic code could have done (agent reasoning, handoff curation, prompt assembly), that's a substrate choice rather than waste. Those tokens are expensive but captured in normal cost reporting.
Cross-framework comparisons. The targets differ. "openevolve has X% waste, Nous has Y%" is not a fair comparison and the proposal won't make it. Per-framework profiles, reported separately.
Where I might be wrong
I want to be honest about a few soft spots in this framing.
First, "obvious waste" examples can fail the definition. I initially used validation-retry loops as the canonical obviously-wasted example, but under the definition they're failure cost, not informational waste. The metric asks "did this call shift belief about the target" and a retry loop is about recovering from a process failure. Different category.
Second, "confidence sharpening doesn't count as a belief shift" is a stipulation. A tighter posterior is literally a reduction in uncertainty. Calling it not a shift is a definitional choice. If confidence updates should count, Nous's waste rate drops substantially.
Third, the substrate-cost carve-out may systematically under-count waste. If an LLM does work that deterministic code could have done, and that work doesn't feed any structural change downstream, those tokens are exempt. The carve-out keeps the metric from punishing deliberate substrate choices, but it leans toward under-counting.
If you spot more soft spots, especially places I'm calling things waste that aren't, or missing waste that is, I want to hear them.
Where I'd love your input
The core question I want your read on:
Does Nous have wasted effort under this definition? If yes, when and where do you see it most clearly?
Concrete examples from campaigns you've run would be gold. Iteration numbers, arm types, whatever shape they took.
A few related questions while you're thinking about it:
or follow-up?
Beta Was this translation helpful? Give feedback.
All reactions