Skip to content

Add comment-efficiency metrics to perturbation benchmark#84

Open
jingxuangu wants to merge 1 commit into
ChicagoHAI:mainfrom
jingxuangu:add-comment-efficiency-metrics
Open

Add comment-efficiency metrics to perturbation benchmark#84
jingxuangu wants to merge 1 commit into
ChicagoHAI:mainfrom
jingxuangu:add-comment-efficiency-metrics

Conversation

@jingxuangu
Copy link
Copy Markdown

Summary

This PR adds budget-aware comment-efficiency metrics to the perturbation benchmark scoring pipeline.

The current benchmark primarily reports seeded-error recall. This is useful, but it does not distinguish concise reviewers from noisy reviewers that find the same number of injected errors by producing many more comments.

This PR preserves the existing detection semantics:

  • quote must match the perturbed text
  • explanation must match the perturbation's why_wrong

It records the first comment index that detects each perturbation and adds:

  • n_detected_at_1, n_detected_at_3, n_detected_at_5, n_detected_at_10
  • recall_at_1, recall_at_3, recall_at_5, recall_at_10
  • comments_per_detected_error
  • detected_per_comment

These are comment-efficiency metrics, not true precision metrics, because unmatched comments may still identify real non-injected issues.

Testing

  • python -m pytest tests/test_perturbation_score.py -q
  • python -m py_compile benchmarks/perturbation/score.py benchmarks/perturbation/models.py benchmarks/perturbation/generate_report.py src/reviewer/cli.py tests/test_perturbation_score.py

Both passed locally.

Notes

I attempted a full local score,report smoke run, but the checked-in perturbation configs appear to use an older pipeline schema and the repo does not include prepared/reviewed artifacts for the current unified runner. This appears unrelated to the scoring metric changes.

@dangng2004
Copy link
Copy Markdown
Contributor

There are a few issues with this PR. The recall_at_k metric rewards systems that prioritize the injected comments, not necessarily ones that are more efficient. For example, a system that emits 10 relevant and 10 irrelevant comments but put the 10 relevant ones first will get 100% on recall_at_k for k from 1 to 10, which is no different from another system that emits only the 10 relevant comments. The second and more fundamental issue is I think we cannot be sure the paper is error-free so comment-efficiency with respect to just the injected errors isn't really useful.

That said, I think some notion of comment efficiency will be helpful. We just need better metrics for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants