fix: try local eval before slow /evaluate endpoint in evaluate_dense#245
Merged
Conversation
51% of TRL training time wasted on 5050 evaluate timeouts (180s × 3 retries = 9 min per evaluation). The local evaluation via evaluate_checks_local takes ~5s. Fix: when task config has checks defined, try local eval FIRST. Only fall through to the slow /evaluate endpoint when no local checks exist. This eliminates the 9-minute timeout for custom YAML tasks that define their own checks. Before: evaluate() [9 min] → if 0.0 → local [5s] After: local [5s] → if no checks → evaluate() [9 min] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
abrichr
added a commit
that referenced
this pull request
Mar 29, 2026
… eval Reverts the evaluate_dense reordering from #245 (local-first was too aggressive — skipped binary eval entirely, losing the signal when 5050 IS available). The actual fix: set evaluate_timeout=15s and evaluate_retries=1 on the WAALiveAdapter in the TRL wrapper. The evaluate_dense logic stays correct (try binary first, local fallback, take max). Training speed comes from fast failure, not from skipping evaluation paths. - Benchmarking: 180s timeout, 3 retries (thorough, one-shot) - Training: 15s timeout, 1 retry (fast feedback, thousands of evals) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4 tasks
abrichr
added a commit
that referenced
this pull request
Mar 29, 2026
… eval (#246) Reverts the evaluate_dense reordering from #245 (local-first was too aggressive — skipped binary eval entirely, losing the signal when 5050 IS available). The actual fix: set evaluate_timeout=15s and evaluate_retries=1 on the WAALiveAdapter in the TRL wrapper. The evaluate_dense logic stays correct (try binary first, local fallback, take max). Training speed comes from fast failure, not from skipping evaluation paths. - Benchmarking: 180s timeout, 3 retries (thorough, one-shot) - Training: 15s timeout, 1 retry (fast feedback, thousands of evals) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
51% of TRL training time wasted on port 5050 evaluate timeouts (180s × 3 retries = 9 min per evaluation). Local evaluation via
evaluate_checks_localtakes ~5s.Before:
evaluate()[9 min timeout] → if score=0.0 →evaluate_checks_local()[5s]After:
evaluate_checks_local()[5s] → if no checks defined →evaluate()[9 min]When task config has
checksdefined (all custom YAML tasks), local eval runs first and the slow /evaluate endpoint is never called. Binary eval is only used as fallback when no local checks exist (WAA built-in tasks).Test plan
test_local_eval_before_binary_when_checks_defined— local eval runs, binary skippedtest_binary_eval_used_when_no_checks— falls through when no checkstest_local_eval_failure_does_not_call_binary— binary still skipped even if local returns 0.0🤖 Generated with Claude Code