Skip to content

Latest commit

 

History

History
97 lines (71 loc) · 4.5 KB

File metadata and controls

97 lines (71 loc) · 4.5 KB
title Adaptive Sandbox Fan-Out Controller
status emerging
authors
Nikola Balic (@nibzard)
based_on
Labruno (GitHub)
Swarm Migration Pattern
category Reliability & Eval
source https://github.com/nibzard/labruno-agent
tags
fan-out
adaptive
parallel-sandboxes
early-stopping
controller
variance
prompt-refinement

Problem

Parallel sandboxes are intoxicating: you can spawn 10... 100... 1000 runs. But two things break quickly:

  1. Diminishing returns: After some N, you're mostly paying for redundant failures or near-duplicate solutions
  2. Prompt fragility: If the prompt is underspecified, scaling N just scales errors (lots of sandboxes fail fast)
  3. Resource risk: Unbounded fan-out can overwhelm budgets, rate limits, or queues
  4. Oscillation risk: Poorly tuned thresholds can cause scale-up/scale-down thrashing as the controller oscillates between decisions

Static "N=10 always" policies don't adapt to task difficulty, model variance, or observed failure rates. Most implementations use static caps rather than true signal-driven adaptation.

Solution

Add a controller that adapts fan-out in real time based on observed signals from early runs.

Core loop:

  1. Start small: Launch a small batch (e.g., N=3-5) in parallel

  2. Early signal sampling: As soon as the first X runs finish (or after T seconds), compute:

    • success rate (exit code / test pass)
    • diversity score (are solutions meaningfully different?)
    • judge confidence / winner margin
    • error clustering (same error everywhere vs varied errors)
  3. Decide next action:

    • Scale up if: success rate is good but quality variance is high (you want a better winner)
    • Stop early if: judge is confident + tests pass + solutions converge
    • Refine prompt / spec if: error clustering is high (everyone fails the same way)
    • Switch strategy if: repeated failure suggests decomposition is needed (spawn investigative sub-agent)
  4. Budget guardrails: Enforce max sandboxes, max runtime, and "no-progress" stop conditions

  5. Hysteresis for stability: Use different thresholds for scale-up vs. stop (e.g., scale up if confidence < 0.65, stop only if > 0.75) to prevent oscillation

flowchart TD
    A[Task] --> B[Launch small batch N=3-5]
    B --> C[Collect early results]
    C --> D{Signals}
    D -->|Good success + high variance| E[Increase N]
    D -->|Good success + high confidence| F[Stop early + return winner]
    D -->|Clustered failures| G[Prompt/spec refinement step]
    D -->|Ambiguous / large task| H[Decompose + sub-agent investigation]
    E --> C
    G --> B
    H --> B
Loading

How to use it

Use when:

  • You're doing "best-of-N codegen + execution" in sandboxes
  • You have cheap objective checks (unit tests, static analysis, schema validation)
  • Latency and cost matter: you want the minimum N that achieves reliability

Concrete heuristics (example):

  • Start N=3
  • If >=2 succeed but disagree and judge confidence < 0.65 -> add +3 more
  • If 0 succeed and top error signature covers >70% runs -> run a "spec clarifier" step, then restart
  • Hysteresis: Stop only if confidence > 0.75 (higher threshold than scale-up) to prevent thrash
  • Hard cap: N_max (e.g., 50), runtime cap, and "two refinement attempts then decompose"

Trade-offs

Pros:

  • Prevents "scale errors" when prompts are bad
  • Lowers spend by stopping early when a clear winner appears
  • Makes sandbox swarms production-safe via budgets and no-progress stopping

Cons:

  • Requires instrumentation (collecting failure signatures, confidence, diversity)
  • Needs careful defaults and hysteresis to avoid oscillation (scale up/down thrash)
  • Bad scoring functions can cause premature stopping
  • Few verified implementations; most systems use static caps instead of true signal-driven adaptation

References