Skip to content

feat: add self-improving benchmark pipeline for OAPE tools#52

Open
rausingh-rh wants to merge 6 commits into
openshift-eng:mainfrom
rausingh-rh:bechmarking
Open

feat: add self-improving benchmark pipeline for OAPE tools#52
rausingh-rh wants to merge 6 commits into
openshift-eng:mainfrom
rausingh-rh:bechmarking

Conversation

@rausingh-rh

@rausingh-rh rausingh-rh commented May 4, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds a self-improving benchmark pipeline that measures and iteratively improves the quality of the OAPE code generation tools (api-generate, api-implement) by comparing their output against real, already-merged EP implementations.

How it works

  1. Feedback loop: Generate code -> compare vs ground truth -> improve tool instructions -> repeat (default 3 iterations)
  2. Bias-free: Repo cloned at pre-EP commit so the agent never sees the actual implementation
  3. Generic improvements: Tool instruction edits are pattern-based, not EP-specific, so improvements benefit all future EPs
  4. Multi-PR support: Combines related PRs into single ground truth for EPs implemented across multiple PRs

Metrics

The report uses actionable metrics instead of abstract scores:

Metric What it measures
Completeness What % of the human's structs, fields, and functions did the tool also generate?
Convention What % of kubebuilder markers match between generated and ground truth?
Build Did make build pass on the generated code?
Matched Files the tool generated that also exist in the human implementation (true positives)
Missed Files in the human implementation that the tool did NOT generate (gaps to close)
Wrong Files the tool touched that it should not have -- unrelated to the EP (real errors)
Extras Useful files the tool generated that the human didn't -- tests, samples, validation (tool outperformed human)
Auto Files auto-generated by make generate/make manifests (expected build artifacts)

Example report table:

| Iter | Tool Version | Completeness | Convention | Build | Matched | Missed | Wrong | Extras | Auto |
|------|-------------|-------------|------------|-------|---------|--------|-------|--------|------|
| 1    | original    | 91.0%       | 97.8%      | PASS  | 6       | 12     | 1     | 5      | 0    |
| 2    | improved-v1 | 99.5%       | 97.8%      | PASS  | 8       | 10     | 1     | 5      | 0    |
| 3    | improved-v2 | 94.2%       | 97.8%      | PASS  | 9       | 9      | 1     | 6      | 0    |

How others can use it

cd benchmark/

# 1. Edit config.yaml with your EP-to-implementation mappings
# 2. Run the feedback loop
python benchmark.py run --config config.yaml

# 3. Review results in benchmark/results/<repo>/ep-<number>/report.md
# 4. Optionally push a good iteration as a PR
python benchmark.py push --ep 1863 --repo <fork-url> --iteration 3 --base-branch main

See benchmark/README.md for full documentation.

Tested on

EP Repo Best Completeness Convention Build Wrong Files
#1863 zero-trust-workload-identity-manager 97.2% 99.6% PASS 1
#1834 external-secrets-operator 99.5% 97.8% PASS 1

New files

File Purpose
benchmark/benchmark.py CLI entry point with run, report, full, push subcommands
benchmark/runner.py Generation + improvement feedback loop via Claude Agent SDK
benchmark/compare.py Delta-aware diff engine, scoring, file classification
benchmark/isolate.py PR timeline resolution, bias-free environment setup
benchmark/ground_truth.py Combined truth extraction from multiple PRs
benchmark/report.py Markdown + JSON report generation
benchmark/models.py Shared data models
benchmark/go_ast_helper/ Go AST extraction for struct-level comparison
benchmark/config.yaml Example config
benchmark/README.md Full usage documentation with metric definitions

Test plan

@openshift-ci openshift-ci Bot requested review from chiragkyal and neha037 May 4, 2026 09:24
@openshift-ci

openshift-ci Bot commented May 4, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rausingh-rh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 4, 2026
rausingh-rh and others added 6 commits June 10, 2026 16:08
Add a benchmark pipeline that measures and improves the quality of
the OAPE code generation tools (api-generate, api-implement) using
a feedback loop against real EP implementations.

The pipeline:
1. Takes curated EP-to-implementation mappings (EP URL + repo + PRs)
2. Clones the operator repo at the pre-EP commit (bias-free)
3. Runs the OAPE tools to generate code from the EP
4. Compares output against the real merged implementation
5. An improver agent analyzes gaps and edits the tool instructions
6. Repeats with the improved tool (default: 3 iterations)
7. Restores original tool files and reports score progression

Key features:
- Bias prevention: repo cloned before EP merge, agent never sees truth
- Multi-PR support: combines related PRs into single ground truth
- File classification: distinguishes auto-generated artifacts,
  formatting changes, valuable extras, and genuinely wrong files
- Adjusted precision: doesn't penalize the tool for generating
  tests or sample configs the human didn't write
- Generic improvements: tool edits are pattern-based, not EP-specific
- Optional PR push: promote any iteration's output to a real PR
- All agents use configurable model (default: claude-opus-4-6 max)

New files:
- benchmark/benchmark.py    - CLI entry point and orchestrator
- benchmark/runner.py       - Generation + improvement feedback loop
- benchmark/compare.py      - Delta-aware diff, scoring, classification
- benchmark/isolate.py      - PR timeline resolution, bias-free env
- benchmark/ground_truth.py - Combined truth extraction from PRs
- benchmark/report.py       - Markdown + JSON report generation
- benchmark/models.py       - Shared data models
- benchmark/go_ast_helper/  - Go AST extraction for struct comparison
- benchmark/config.yaml     - Example config with EP #1863
- benchmark/README.md       - Full usage documentation

Co-authored-by: Cursor <cursoragent@cursor.com>
Remove the misleading "precision" metric that penalized the tool for
generating better code (extra tests, samples, validation). Replace with
clear file classification counts in the report:

- Matched: files matching the human implementation
- Missed: files the tool didn't generate (gaps)
- Wrong: files that should not have been touched (real errors)
- Extras: useful files the human didn't write (tool outperformed)
- Auto: make-generated artifacts (expected behavior)

This gives a much clearer picture: "1 wrong file" is more actionable
than "45% precision."

Co-authored-by: Cursor <cursoragent@cursor.com>
vendor/ files are dependency artifacts, not implementation code.
e2e tests are handled by a separate OAPE command (/oape:e2e-generate),
so they should not be part of the api-generate + api-implement evaluation.

Excluded paths: vendor/*, test/e2e/*, tests/e2e/*, e2e/*,
*_e2e_test.go, *_e2e_suite_test.go

Co-authored-by: Cursor <cursoragent@cursor.com>
…ases

Instead of improving the tool after every iteration of every EP (which
causes overfitting and bloat), use a three-phase approach:

Phase 1 (measure): Run all EPs once with the original tool
Phase 2 (improve): Analyze ALL results together, find cross-EP patterns,
  make ONE set of concise improvements
Phase 3 (verify): Run all EPs again with the improved tool

This produces targeted, general improvements that work across all EPs
instead of EP-specific bloat.

CLI: benchmark.py measure|improve|verify|full
Co-authored-by: Cursor <cursoragent@cursor.com>
- Remove unused _build_improvement_prompt, improve_tool, run_feedback_loop
- Fix mislabeled "missed files" line in improvement prompt
- Remove unused imports
- Add config.yaml with 9 EPs across 4 operators
- EP #1964 excluded (openshift/must-gather is Shell, not Go)

Co-authored-by: Cursor <cursoragent@cursor.com>
Previously, completeness was calculated only across matched files
(files both the tool and human touched), ignoring missed files entirely.
This gave inflated scores -- e.g., 97.2% when only 4 out of 31 Go files
were matched.

Now, every missed Go file in the ground truth contributes a 0% score to
the average. If the tool matches 4 files at 97% but misses 10 files,
completeness = (4*97% + 10*0%) / 14 = 27.7%, not 97%.

This gives an honest picture of how much of the total implementation
the tool actually covered.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant