fix: try local eval before slow /evaluate endpoint in evaluate_dense by abrichr · Pull Request #245 · OpenAdaptAI/openadapt-evals

abrichr · 2026-03-29T20:13:05Z

Summary

51% of TRL training time wasted on port 5050 evaluate timeouts (180s × 3 retries = 9 min per evaluation). Local evaluation via evaluate_checks_local takes ~5s.

Before: evaluate() [9 min timeout] → if score=0.0 → evaluate_checks_local() [5s]
After: evaluate_checks_local() [5s] → if no checks defined → evaluate() [9 min]

When task config has checks defined (all custom YAML tasks), local eval runs first and the slow /evaluate endpoint is never called. Binary eval is only used as fallback when no local checks exist (WAA built-in tasks).

Test plan

test_local_eval_before_binary_when_checks_defined — local eval runs, binary skipped
test_binary_eval_used_when_no_checks — falls through when no checks
test_local_eval_failure_does_not_call_binary — binary still skipped even if local returns 0.0
1471 passed, 54 skipped in full suite

🤖 Generated with Claude Code

51% of TRL training time wasted on 5050 evaluate timeouts (180s × 3 retries = 9 min per evaluation). The local evaluation via evaluate_checks_local takes ~5s. Fix: when task config has checks defined, try local eval FIRST. Only fall through to the slow /evaluate endpoint when no local checks exist. This eliminates the 9-minute timeout for custom YAML tasks that define their own checks. Before: evaluate() [9 min] → if 0.0 → local [5s] After: local [5s] → if no checks → evaluate() [9 min] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… eval Reverts the evaluate_dense reordering from #245 (local-first was too aggressive — skipped binary eval entirely, losing the signal when 5050 IS available). The actual fix: set evaluate_timeout=15s and evaluate_retries=1 on the WAALiveAdapter in the TRL wrapper. The evaluate_dense logic stays correct (try binary first, local fallback, take max). Training speed comes from fast failure, not from skipping evaluation paths. - Benchmarking: 180s timeout, 3 retries (thorough, one-shot) - Training: 15s timeout, 1 retry (fast feedback, thousands of evals) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… eval (#246) Reverts the evaluate_dense reordering from #245 (local-first was too aggressive — skipped binary eval entirely, losing the signal when 5050 IS available). The actual fix: set evaluate_timeout=15s and evaluate_retries=1 on the WAALiveAdapter in the TRL wrapper. The evaluate_dense logic stays correct (try binary first, local fallback, take max). Training speed comes from fast failure, not from skipping evaluation paths. - Benchmarking: 180s timeout, 3 retries (thorough, one-shot) - Training: 15s timeout, 1 retry (fast feedback, thousands of evals) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

abrichr merged commit 3b8c1c2 into main Mar 29, 2026
1 check passed

abrichr mentioned this pull request Mar 29, 2026

fix: use training-appropriate evaluate timeouts instead of reordering eval #246

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: try local eval before slow /evaluate endpoint in evaluate_dense#245

fix: try local eval before slow /evaluate endpoint in evaluate_dense#245
abrichr merged 1 commit into
mainfrom
fix/skip-slow-binary-eval-for-milestones

abrichr commented Mar 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abrichr commented Mar 29, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant