Add comment-efficiency metrics to perturbation benchmark by jingxuangu · Pull Request #84 · ChicagoHAI/OpenAIReview

jingxuangu · 2026-05-15T01:47:11Z

Summary

This PR adds budget-aware comment-efficiency metrics to the perturbation benchmark scoring pipeline.

The current benchmark primarily reports seeded-error recall. This is useful, but it does not distinguish concise reviewers from noisy reviewers that find the same number of injected errors by producing many more comments.

This PR preserves the existing detection semantics:

quote must match the perturbed text
explanation must match the perturbation's why_wrong

It records the first comment index that detects each perturbation and adds:

n_detected_at_1, n_detected_at_3, n_detected_at_5, n_detected_at_10
recall_at_1, recall_at_3, recall_at_5, recall_at_10
comments_per_detected_error
detected_per_comment

These are comment-efficiency metrics, not true precision metrics, because unmatched comments may still identify real non-injected issues.

Testing

python -m pytest tests/test_perturbation_score.py -q
python -m py_compile benchmarks/perturbation/score.py benchmarks/perturbation/models.py benchmarks/perturbation/generate_report.py src/reviewer/cli.py tests/test_perturbation_score.py

Both passed locally.

Notes

I attempted a full local score,report smoke run, but the checked-in perturbation configs appear to use an older pipeline schema and the repo does not include prepared/reviewed artifacts for the current unified runner. This appears unrelated to the scoring metric changes.

dangng2004 · 2026-05-21T23:02:25Z

There are a few issues with this PR. The recall_at_k metric rewards systems that prioritize the injected comments, not necessarily ones that are more efficient. For example, a system that emits 10 relevant and 10 irrelevant comments but put the 10 relevant ones first will get 100% on recall_at_k for k from 1 to 10, which is no different from another system that emits only the 10 relevant comments. The second and more fundamental issue is I think we cannot be sure the paper is error-free so comment-efficiency with respect to just the injected errors isn't really useful.

That said, I think some notion of comment efficiency will be helpful. We just need better metrics for this.

Add budget-aware metrics to perturbation scoring

2d2adc6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add comment-efficiency metrics to perturbation benchmark#84

Add comment-efficiency metrics to perturbation benchmark#84
jingxuangu wants to merge 1 commit into
ChicagoHAI:mainfrom
jingxuangu:add-comment-efficiency-metrics

jingxuangu commented May 15, 2026

Uh oh!

dangng2004 commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jingxuangu commented May 15, 2026

Summary

Testing

Notes

Uh oh!

dangng2004 commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants