feat: add repeat and repeat-fail-on-threshold inputs by tgvashworth · Pull Request #865 · promptfoo/promptfoo-action

tgvashworth · 2026-03-12T07:24:23Z

Adds repeat input that runs each test N times via promptfoo's --repeat flag
Adds repeat-fail-on-threshold input that checks each individual test passes a minimum percentage of its repeated runs
When thresholds are configured and pass, the action succeeds even if promptfoo exits non-zero (which it does whenever any test fails)
Adds info logging showing repeat config and threshold results

Example usage

- uses: promptfoo/promptfoo-action@v1
  with:
    config: evals/skills.yaml
    repeat: 3
    repeat-fail-on-threshold: 66
    fail-on-threshold: 90

This runs each test 3 times and requires each test to pass at least 2/3 runs. The suite-level threshold requires 90% overall pass rate. Both are checked independently.

Testing

96 tests pass locally (npm test)
Tested in CI on and internal repo with real LLM evals
Verified --repeat 3 is passed to promptfoo and NxR test cases run
Verified per-test grouping works correctly (groups by description across repeats)
Verified threshold pass suppresses exec error
dist/ rebuilt and included

Add two new inputs for handling non-deterministic LLM evals: - `repeat`: runs each test N times via promptfoo's --repeat flag - `repeat-fail-on-threshold`: per-test threshold requiring each individual test to pass a minimum percentage of its repeated runs Example: repeat=3 with repeat-fail-on-threshold=66 means each test must pass at least 2 out of 3 runs. This filters out systematic failures while tolerating random grader variance. Key design decisions: - Per-test best-of-N, not global aggregate: results are grouped by test description (or vars as fallback) and each test is checked independently against the threshold - Both fail-on-threshold and repeat-fail-on-threshold run independently when both are set - When thresholds are configured and pass, the action succeeds even if promptfoo exits non-zero (which it does whenever any test fails) - Info logging shows repeat config, threshold results, and clear explanations when exec errors are suppressed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-03-12T07:31:27Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 469431f8-2d61-4da3-9331-ff3ce12530f8

📥 Commits

Reviewing files that changed from the base of the PR and between 88c88ba and 7427029.

⛔ Files ignored due to path filters (2)

dist/index.js is excluded by !**/dist/**
dist/index.js.map is excluded by !**/dist/**, !**/*.map

📒 Files selected for processing (5)

README.md
__tests__/main.test.ts
action.yml
src/main.ts
src/utils/errors.ts

Walkthrough

This pull request introduces repeat functionality for handling flaky LLM evaluations. Two new action inputs are added: repeat (number of test repetitions, default 1) and repeat-fail-on-threshold (minimum per-test pass rate percentage). The implementation includes validation for both inputs, per-test aggregation logic to compute pass rates across repeated runs, and enhanced error handling to tolerate non-zero exit codes when thresholds pass. Documentation is updated with usage examples, and comprehensive test coverage is added for repeat scenarios, threshold validation, and combined behaviors. No code logic changes affect existing functionality when these inputs are not used.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title accurately describes the main feature additions: adding repeat and repeat-fail-on-threshold inputs as shown across action.yml, src/main.ts, and test files.
Description check	✅ Passed	The description directly addresses the changeset, explaining the repeat and repeat-fail-on-threshold features with usage examples, test coverage, and implementation details.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Redesign the repeat/threshold feature from PR #865 for better DX: - Rename repeat-fail-on-threshold to repeat-min-pass (absolute count instead of percentage — "2 of 3" not "66%") - Change repeat default from '1' to '' (omitted = absent) - Add strict input parsing that rejects "2.5", "3abc", "02" - Add cross-field validation (repeat >= 2, min-pass <= repeat) - Scope exec error suppression to repeat-min-pass only (preserves backward compat for fail-on-threshold users) - Use ignoreReturnCode instead of try/catch on exec - Use unique per-run output file path to prevent stale results - Fail hard on ambiguous/partial grouping instead of warning - Add repeat summary to PR comments and workflow summaries - Extract input parsing and threshold logic into utility modules - Add 130 tests with 100% coverage on new utilities Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

mldangelo-oai · 2026-04-28T01:03:02Z

Thank you so much for putting this together, @tgvashworth. This was a really useful contribution: the original repeat support, threshold idea, docs, and tests gave us the right foundation for handling noisy LLM evals in the action. I really appreciate the thoughtful implementation and the concrete examples in the PR. I’m going to merge this first so your original work lands with proper credit, then I’ll rebase the follow-up redesign/hardening PR on top of it.

Merge main into repeat threshold PR

938b2f8

mldangelo-oai merged commit 03ebdf7 into promptfoo:main Apr 28, 2026
2 checks passed

github-actions Bot mentioned this pull request Apr 28, 2026

chore(main): release promptfoo-action 1.3.0 #918

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add repeat and repeat-fail-on-threshold inputs#865

feat: add repeat and repeat-fail-on-threshold inputs#865
mldangelo-oai merged 2 commits into
promptfoo:mainfrom
incident-io:implement-repeat-threshold

tgvashworth commented Mar 12, 2026

Uh oh!

coderabbitai Bot commented Mar 12, 2026

❌ Failed checks (1 warning)

Uh oh!

mldangelo-oai commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tgvashworth commented Mar 12, 2026

Example usage

Testing

Uh oh!

coderabbitai Bot commented Mar 12, 2026

Walkthrough

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

mldangelo-oai commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants