feat: add self-improving benchmark pipeline for OAPE tools#52
Open
rausingh-rh wants to merge 6 commits into
Open
feat: add self-improving benchmark pipeline for OAPE tools#52rausingh-rh wants to merge 6 commits into
rausingh-rh wants to merge 6 commits into
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: rausingh-rh The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Add a benchmark pipeline that measures and improves the quality of the OAPE code generation tools (api-generate, api-implement) using a feedback loop against real EP implementations. The pipeline: 1. Takes curated EP-to-implementation mappings (EP URL + repo + PRs) 2. Clones the operator repo at the pre-EP commit (bias-free) 3. Runs the OAPE tools to generate code from the EP 4. Compares output against the real merged implementation 5. An improver agent analyzes gaps and edits the tool instructions 6. Repeats with the improved tool (default: 3 iterations) 7. Restores original tool files and reports score progression Key features: - Bias prevention: repo cloned before EP merge, agent never sees truth - Multi-PR support: combines related PRs into single ground truth - File classification: distinguishes auto-generated artifacts, formatting changes, valuable extras, and genuinely wrong files - Adjusted precision: doesn't penalize the tool for generating tests or sample configs the human didn't write - Generic improvements: tool edits are pattern-based, not EP-specific - Optional PR push: promote any iteration's output to a real PR - All agents use configurable model (default: claude-opus-4-6 max) New files: - benchmark/benchmark.py - CLI entry point and orchestrator - benchmark/runner.py - Generation + improvement feedback loop - benchmark/compare.py - Delta-aware diff, scoring, classification - benchmark/isolate.py - PR timeline resolution, bias-free env - benchmark/ground_truth.py - Combined truth extraction from PRs - benchmark/report.py - Markdown + JSON report generation - benchmark/models.py - Shared data models - benchmark/go_ast_helper/ - Go AST extraction for struct comparison - benchmark/config.yaml - Example config with EP #1863 - benchmark/README.md - Full usage documentation Co-authored-by: Cursor <cursoragent@cursor.com>
Remove the misleading "precision" metric that penalized the tool for generating better code (extra tests, samples, validation). Replace with clear file classification counts in the report: - Matched: files matching the human implementation - Missed: files the tool didn't generate (gaps) - Wrong: files that should not have been touched (real errors) - Extras: useful files the human didn't write (tool outperformed) - Auto: make-generated artifacts (expected behavior) This gives a much clearer picture: "1 wrong file" is more actionable than "45% precision." Co-authored-by: Cursor <cursoragent@cursor.com>
vendor/ files are dependency artifacts, not implementation code. e2e tests are handled by a separate OAPE command (/oape:e2e-generate), so they should not be part of the api-generate + api-implement evaluation. Excluded paths: vendor/*, test/e2e/*, tests/e2e/*, e2e/*, *_e2e_test.go, *_e2e_suite_test.go Co-authored-by: Cursor <cursoragent@cursor.com>
…ases Instead of improving the tool after every iteration of every EP (which causes overfitting and bloat), use a three-phase approach: Phase 1 (measure): Run all EPs once with the original tool Phase 2 (improve): Analyze ALL results together, find cross-EP patterns, make ONE set of concise improvements Phase 3 (verify): Run all EPs again with the improved tool This produces targeted, general improvements that work across all EPs instead of EP-specific bloat. CLI: benchmark.py measure|improve|verify|full Co-authored-by: Cursor <cursoragent@cursor.com>
- Remove unused _build_improvement_prompt, improve_tool, run_feedback_loop - Fix mislabeled "missed files" line in improvement prompt - Remove unused imports - Add config.yaml with 9 EPs across 4 operators - EP #1964 excluded (openshift/must-gather is Shell, not Go) Co-authored-by: Cursor <cursoragent@cursor.com>
Previously, completeness was calculated only across matched files (files both the tool and human touched), ignoring missed files entirely. This gave inflated scores -- e.g., 97.2% when only 4 out of 31 Go files were matched. Now, every missed Go file in the ground truth contributes a 0% score to the average. If the tool matches 4 files at 97% but misses 10 files, completeness = (4*97% + 10*0%) / 14 = 27.7%, not 97%. This gives an honest picture of how much of the total implementation the tool actually covered. Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a self-improving benchmark pipeline that measures and iteratively improves the quality of the OAPE code generation tools (
api-generate,api-implement) by comparing their output against real, already-merged EP implementations.How it works
Metrics
The report uses actionable metrics instead of abstract scores:
make buildpass on the generated code?make generate/make manifests(expected build artifacts)Example report table:
How others can use it
See benchmark/README.md for full documentation.
Tested on
New files
benchmark/benchmark.pyrun,report,full,pushsubcommandsbenchmark/runner.pybenchmark/compare.pybenchmark/isolate.pybenchmark/ground_truth.pybenchmark/report.pybenchmark/models.pybenchmark/go_ast_helper/benchmark/config.yamlbenchmark/README.mdTest plan
make buildpasses on generated code