Skip to content

Fix #283: token budget + graceful degradation for /deep-research#316

Open
ericleepi314 wants to merge 1 commit into
fix/issue-282-structured-output-coercionfrom
feature/issue-283-research-budget
Open

Fix #283: token budget + graceful degradation for /deep-research#316
ericleepi314 wants to merge 1 commit into
fix/issue-282-structured-output-coercionfrom
feature/issue-283-research-budget

Conversation

@ericleepi314

Copy link
Copy Markdown
Collaborator

Closes #283

Stacked on #315 (deep stack down to #304). Merge in order; GitHub retargets automatically.

Summary

/deep-research had no token ceiling — a verbose model (deepseek) burned ~888k tokens in Search+Verify before the user saw anything.

Engine

  • meta.default_budget — a workflow may declare its own ceiling, applied only when the caller set no budget (explicit budget_total and inherited parent Budgets win; nested workflow() children share the parent's budget and are unaffected).
  • Per-workflow env override CLAWCODEX_<NAME>_TOKEN_BUDGET — deep-research reads exactly the env var the issue names, CLAWCODEX_DEEP_RESEARCH_TOKEN_BUDGET (0 disables; malformed values ignored).

deep-research script (default_budget: 400000)

  • Verify gating: launches only as many verifiers as the remaining budget affords (per-verifier cost estimated from the observed Search spend; a 40k Synthesize reserve held back). Unaffordable claims pass through unverified — logged, never silent (an unrun check contradicts nothing under the "supported unless contradicted" verdict contract).
  • None verdicts keep their claims (crashed verifier or ceiling trip) instead of silently dropping them — a pre-existing bug fixed along the way.
  • Overshoot fallback (critic-reproduced failure: spend within an already-launched wave is uncapped, and a ceiling trip at the final Synthesize call previously failed the whole run with no report after full spend): the script re-checks the budget before Synthesize and falls back to returning the raw surviving claims — expensive something instead of expensive nothing.
  • Per-stage spend surfaced via log() lines (the progress-UI narrator).

Test plan

  • 14 tests: default-budget resolution units (env precedence/zero/malformed, bool/string rejection, bundled meta declares 400k), meta-default-reaches-script + explicit-budget-wins, and degradation through the real engine + real bundled script with deterministic runners: tight budget → 0 verifiers but synthesize runs; partial → exactly 2 of 4 verified; no budget → meta default, all verified; crashed verifier → claim retained; incident-profile overshoot (4×120k vs 400k) → raw-claims fallback, error is None, exactly 4 agent calls
  • Workflow suite: 186 passed
  • Full suite on the stack: 7879 passed, 0 failed, 5 skipped
  • Critic review loop: APPROVE after 1 revision round (the Synthesize-trip failure was found and reproduced there, on both verbose-search and cheap-search/verbose-verify profiles)

Follow-ups noted in review (non-gating): chunked verify waves with budget recomputation, exposing workflow error classes in the sandbox builtins, a user-facing budget parameter on the Workflow tool.

🤖 Generated with Claude Code

A verbose model burned ~888k tokens in Search+Verify with no ceiling
and no warning.

- engine: a workflow's meta may declare default_budget, applied when
  the caller set no budget (explicit budget_total and inherited parent
  Budgets win; nested workflow() children unaffected). Per-workflow env
  override CLAWCODEX_<NAME>_TOKEN_BUDGET — deep-research reads exactly
  CLAWCODEX_DEEP_RESEARCH_TOKEN_BUDGET (0 disables)
- deep-research declares default_budget=400000 and degrades instead of
  dying: the Verify fan-out only launches as many verifiers as the
  remaining budget affords (estimated from the observed Search spend,
  Synthesize reserve held back); unaffordable claims pass through
  UNVERIFIED with a log line; a None verdict (crashed verifier or
  ceiling trip) keeps its claim instead of silently dropping it; and
  if the already-launched waves overshot the whole budget, Synthesize
  falls back to returning the raw surviving claims rather than failing
  after full spend
- per-stage spend surfaced via log lines (the progress UI narrator)

Closes #283

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant