Skip to content

Add threshold sweep tool for perturbation scoring#90

Draft
dangng2004 wants to merge 1 commit into
mainfrom
feat/threshold-sweep
Draft

Add threshold sweep tool for perturbation scoring#90
dangng2004 wants to merge 1 commit into
mainfrom
feat/threshold-sweep

Conversation

@dangng2004
Copy link
Copy Markdown
Contributor

Summary

  • threshold_sweep.py recomputes detection recall at a range of fuzzy-coverage thresholds by reusing already-completed reviewer outputs
  • LLM-judge decisions are threshold-independent, so they're cached on disk: one judgment per (perturbation, comment) pair, then sweeping is a pure re-aggregation
  • Pairs with the --threshold / --substring-gate flags added in PR Fix substring-match scorer bug; add gate/threshold flags #87 so the operating point can be chosen from data rather than guessed

Test plan

  • Point at an existing results/<run>/<reviewer-model>/ tree and confirm it produces a CSV of recall-by-threshold
  • Confirm second run uses the cache and skips LLM-judge calls

🤖 Generated with Claude Code

threshold_sweep.py recomputes recall at a range of fuzzy-coverage thresholds
by reusing already-completed reviewer outputs and an on-disk cache of
LLM-judge decisions (which are threshold-independent). One judgment per
(perturbation, comment) pair; sweeping is then a pure re-aggregation.

Pairs with the --threshold / --substring-gate flags added in PR #87 to
support picking an operating point from data rather than guessing.
@dangng2004 dangng2004 marked this pull request as draft May 21, 2026 20:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant