feat: add repeat and repeat-fail-on-threshold inputs#865
Conversation
Add two new inputs for handling non-deterministic LLM evals: - `repeat`: runs each test N times via promptfoo's --repeat flag - `repeat-fail-on-threshold`: per-test threshold requiring each individual test to pass a minimum percentage of its repeated runs Example: repeat=3 with repeat-fail-on-threshold=66 means each test must pass at least 2 out of 3 runs. This filters out systematic failures while tolerating random grader variance. Key design decisions: - Per-test best-of-N, not global aggregate: results are grouped by test description (or vars as fallback) and each test is checked independently against the threshold - Both fail-on-threshold and repeat-fail-on-threshold run independently when both are set - When thresholds are configured and pass, the action succeeds even if promptfoo exits non-zero (which it does whenever any test fails) - Info logging shows repeat config, threshold results, and clear explanations when exec errors are suppressed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: ⛔ Files ignored due to path filters (2)
📒 Files selected for processing (5)
WalkthroughThis pull request introduces repeat functionality for handling flaky LLM evaluations. Two new action inputs are added: Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Redesign the repeat/threshold feature from PR #865 for better DX: - Rename repeat-fail-on-threshold to repeat-min-pass (absolute count instead of percentage — "2 of 3" not "66%") - Change repeat default from '1' to '' (omitted = absent) - Add strict input parsing that rejects "2.5", "3abc", "02" - Add cross-field validation (repeat >= 2, min-pass <= repeat) - Scope exec error suppression to repeat-min-pass only (preserves backward compat for fail-on-threshold users) - Use ignoreReturnCode instead of try/catch on exec - Use unique per-run output file path to prevent stale results - Fail hard on ambiguous/partial grouping instead of warning - Add repeat summary to PR comments and workflow summaries - Extract input parsing and threshold logic into utility modules - Add 130 tests with 100% coverage on new utilities Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Thank you so much for putting this together, @tgvashworth. This was a really useful contribution: the original repeat support, threshold idea, docs, and tests gave us the right foundation for handling noisy LLM evals in the action. I really appreciate the thoughtful implementation and the concrete examples in the PR. I’m going to merge this first so your original work lands with proper credit, then I’ll rebase the follow-up redesign/hardening PR on top of it. |
repeatinput that runs each test N times via promptfoo's--repeatflagrepeat-fail-on-thresholdinput that checks each individual test passes a minimum percentage of its repeated runsExample usage
This runs each test 3 times and requires each test to pass at least 2/3 runs. The suite-level threshold requires 90% overall pass rate. Both are checked independently.
Testing
npm test)--repeat 3is passed to promptfoo and NxR test cases rundist/rebuilt and included