Skip to content

fix: try local eval before slow /evaluate endpoint in evaluate_dense#245

Merged
abrichr merged 1 commit into
mainfrom
fix/skip-slow-binary-eval-for-milestones
Mar 29, 2026
Merged

fix: try local eval before slow /evaluate endpoint in evaluate_dense#245
abrichr merged 1 commit into
mainfrom
fix/skip-slow-binary-eval-for-milestones

Conversation

@abrichr
Copy link
Copy Markdown
Member

@abrichr abrichr commented Mar 29, 2026

Summary

51% of TRL training time wasted on port 5050 evaluate timeouts (180s × 3 retries = 9 min per evaluation). Local evaluation via evaluate_checks_local takes ~5s.

Before: evaluate() [9 min timeout] → if score=0.0 → evaluate_checks_local() [5s]
After: evaluate_checks_local() [5s] → if no checks defined → evaluate() [9 min]

When task config has checks defined (all custom YAML tasks), local eval runs first and the slow /evaluate endpoint is never called. Binary eval is only used as fallback when no local checks exist (WAA built-in tasks).

Test plan

  • test_local_eval_before_binary_when_checks_defined — local eval runs, binary skipped
  • test_binary_eval_used_when_no_checks — falls through when no checks
  • test_local_eval_failure_does_not_call_binary — binary still skipped even if local returns 0.0
  • 1471 passed, 54 skipped in full suite

🤖 Generated with Claude Code

51% of TRL training time wasted on 5050 evaluate timeouts (180s × 3
retries = 9 min per evaluation). The local evaluation via
evaluate_checks_local takes ~5s.

Fix: when task config has checks defined, try local eval FIRST. Only
fall through to the slow /evaluate endpoint when no local checks exist.
This eliminates the 9-minute timeout for custom YAML tasks that define
their own checks.

Before: evaluate() [9 min] → if 0.0 → local [5s]
After:  local [5s] → if no checks → evaluate() [9 min]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@abrichr abrichr merged commit 3b8c1c2 into main Mar 29, 2026
1 check passed
abrichr added a commit that referenced this pull request Mar 29, 2026
… eval

Reverts the evaluate_dense reordering from #245 (local-first was too
aggressive — skipped binary eval entirely, losing the signal when 5050
IS available).

The actual fix: set evaluate_timeout=15s and evaluate_retries=1 on the
WAALiveAdapter in the TRL wrapper. The evaluate_dense logic stays
correct (try binary first, local fallback, take max). Training speed
comes from fast failure, not from skipping evaluation paths.

- Benchmarking: 180s timeout, 3 retries (thorough, one-shot)
- Training: 15s timeout, 1 retry (fast feedback, thousands of evals)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
abrichr added a commit that referenced this pull request Mar 29, 2026
… eval (#246)

Reverts the evaluate_dense reordering from #245 (local-first was too
aggressive — skipped binary eval entirely, losing the signal when 5050
IS available).

The actual fix: set evaluate_timeout=15s and evaluate_retries=1 on the
WAALiveAdapter in the TRL wrapper. The evaluate_dense logic stays
correct (try binary first, local fallback, take max). Training speed
comes from fast failure, not from skipping evaluation paths.

- Benchmarking: 180s timeout, 3 retries (thorough, one-shot)
- Training: 15s timeout, 1 retry (fast feedback, thousands of evals)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant