feat: add self-improving benchmark pipeline for OAPE tools by rausingh-rh · Pull Request #52 · openshift-eng/oape-ai-e2e

rausingh-rh · 2026-05-04T09:23:57Z

Summary

Adds a self-improving benchmark pipeline that measures and iteratively improves the quality of the OAPE code generation tools (api-generate, api-implement) by comparing their output against real, already-merged EP implementations.

How it works

Feedback loop: Generate code -> compare vs ground truth -> improve tool instructions -> repeat (default 3 iterations)
Bias-free: Repo cloned at pre-EP commit so the agent never sees the actual implementation
Generic improvements: Tool instruction edits are pattern-based, not EP-specific, so improvements benefit all future EPs
Multi-PR support: Combines related PRs into single ground truth for EPs implemented across multiple PRs

Metrics

The report uses actionable metrics instead of abstract scores:

Metric	What it measures
Completeness	What % of the human's structs, fields, and functions did the tool also generate?
Convention	What % of kubebuilder markers match between generated and ground truth?
Build	Did `make build` pass on the generated code?
Matched	Files the tool generated that also exist in the human implementation (true positives)
Missed	Files in the human implementation that the tool did NOT generate (gaps to close)
Wrong	Files the tool touched that it should not have -- unrelated to the EP (real errors)
Extras	Useful files the tool generated that the human didn't -- tests, samples, validation (tool outperformed human)
Auto	Files auto-generated by `make generate`/`make manifests` (expected build artifacts)

Example report table:

| Iter | Tool Version | Completeness | Convention | Build | Matched | Missed | Wrong | Extras | Auto |
|------|-------------|-------------|------------|-------|---------|--------|-------|--------|------|
| 1    | original    | 91.0%       | 97.8%      | PASS  | 6       | 12     | 1     | 5      | 0    |
| 2    | improved-v1 | 99.5%       | 97.8%      | PASS  | 8       | 10     | 1     | 5      | 0    |
| 3    | improved-v2 | 94.2%       | 97.8%      | PASS  | 9       | 9      | 1     | 6      | 0    |

How others can use it

cd benchmark/

# 1. Edit config.yaml with your EP-to-implementation mappings
# 2. Run the feedback loop
python benchmark.py run --config config.yaml

# 3. Review results in benchmark/results/<repo>/ep-<number>/report.md
# 4. Optionally push a good iteration as a PR
python benchmark.py push --ep 1863 --repo <fork-url> --iteration 3 --base-branch main

See benchmark/README.md for full documentation.

Tested on

EP	Repo	Best Completeness	Convention	Build	Wrong Files
#1863	zero-trust-workload-identity-manager	97.2%	99.6%	PASS	1
#1834	external-secrets-operator	99.5%	97.8%	PASS	1

New files

File	Purpose
`benchmark/benchmark.py`	CLI entry point with `run`, `report`, `full`, `push` subcommands
`benchmark/runner.py`	Generation + improvement feedback loop via Claude Agent SDK
`benchmark/compare.py`	Delta-aware diff engine, scoring, file classification
`benchmark/isolate.py`	PR timeline resolution, bias-free environment setup
`benchmark/ground_truth.py`	Combined truth extraction from multiple PRs
`benchmark/report.py`	Markdown + JSON report generation
`benchmark/models.py`	Shared data models
`benchmark/go_ast_helper/`	Go AST extraction for struct-level comparison
`benchmark/config.yaml`	Example config
`benchmark/README.md`	Full usage documentation with metric definitions

Test plan

Run benchmark on EP #1863 (ZTWIM) with PRs [68, 82] -- 3 iterations, all PASS
Run benchmark on EP #1834 (ESO) with PRs [67, 74] -- 3 iterations, all PASS
Verify bias prevention (no EP code at baseline commit)
Verify tool improvements are generic (not EP-specific)
Verify original tool files restored after benchmark
Verify make build passes on generated code
PR raised from generated output: feat: add SPIRE federation support (EP #1863) - OAPE benchmark improved-v2 rausingh-rh/zero-trust-workload-identity-manager#8

openshift-ci · 2026-05-04T09:24:06Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rausingh-rh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [rausingh-rh]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Add a benchmark pipeline that measures and improves the quality of the OAPE code generation tools (api-generate, api-implement) using a feedback loop against real EP implementations. The pipeline: 1. Takes curated EP-to-implementation mappings (EP URL + repo + PRs) 2. Clones the operator repo at the pre-EP commit (bias-free) 3. Runs the OAPE tools to generate code from the EP 4. Compares output against the real merged implementation 5. An improver agent analyzes gaps and edits the tool instructions 6. Repeats with the improved tool (default: 3 iterations) 7. Restores original tool files and reports score progression Key features: - Bias prevention: repo cloned before EP merge, agent never sees truth - Multi-PR support: combines related PRs into single ground truth - File classification: distinguishes auto-generated artifacts, formatting changes, valuable extras, and genuinely wrong files - Adjusted precision: doesn't penalize the tool for generating tests or sample configs the human didn't write - Generic improvements: tool edits are pattern-based, not EP-specific - Optional PR push: promote any iteration's output to a real PR - All agents use configurable model (default: claude-opus-4-6 max) New files: - benchmark/benchmark.py - CLI entry point and orchestrator - benchmark/runner.py - Generation + improvement feedback loop - benchmark/compare.py - Delta-aware diff, scoring, classification - benchmark/isolate.py - PR timeline resolution, bias-free env - benchmark/ground_truth.py - Combined truth extraction from PRs - benchmark/report.py - Markdown + JSON report generation - benchmark/models.py - Shared data models - benchmark/go_ast_helper/ - Go AST extraction for struct comparison - benchmark/config.yaml - Example config with EP #1863 - benchmark/README.md - Full usage documentation Co-authored-by: Cursor <cursoragent@cursor.com>

Remove the misleading "precision" metric that penalized the tool for generating better code (extra tests, samples, validation). Replace with clear file classification counts in the report: - Matched: files matching the human implementation - Missed: files the tool didn't generate (gaps) - Wrong: files that should not have been touched (real errors) - Extras: useful files the human didn't write (tool outperformed) - Auto: make-generated artifacts (expected behavior) This gives a much clearer picture: "1 wrong file" is more actionable than "45% precision." Co-authored-by: Cursor <cursoragent@cursor.com>

vendor/ files are dependency artifacts, not implementation code. e2e tests are handled by a separate OAPE command (/oape:e2e-generate), so they should not be part of the api-generate + api-implement evaluation. Excluded paths: vendor/*, test/e2e/*, tests/e2e/*, e2e/*, *_e2e_test.go, *_e2e_suite_test.go Co-authored-by: Cursor <cursoragent@cursor.com>

…ases Instead of improving the tool after every iteration of every EP (which causes overfitting and bloat), use a three-phase approach: Phase 1 (measure): Run all EPs once with the original tool Phase 2 (improve): Analyze ALL results together, find cross-EP patterns, make ONE set of concise improvements Phase 3 (verify): Run all EPs again with the improved tool This produces targeted, general improvements that work across all EPs instead of EP-specific bloat. CLI: benchmark.py measure|improve|verify|full Co-authored-by: Cursor <cursoragent@cursor.com>

- Remove unused _build_improvement_prompt, improve_tool, run_feedback_loop - Fix mislabeled "missed files" line in improvement prompt - Remove unused imports - Add config.yaml with 9 EPs across 4 operators - EP #1964 excluded (openshift/must-gather is Shell, not Go) Co-authored-by: Cursor <cursoragent@cursor.com>

Previously, completeness was calculated only across matched files (files both the tool and human touched), ignoring missed files entirely. This gave inflated scores -- e.g., 97.2% when only 4 out of 31 Go files were matched. Now, every missed Go file in the ground truth contributes a 0% score to the average. If the tool matches 4 files at 97% but misses 10 files, completeness = (4*97% + 10*0%) / 14 = 27.7%, not 97%. This gives an honest picture of how much of the total implementation the tool actually covered. Co-authored-by: Cursor <cursoragent@cursor.com>

openshift-ci Bot requested review from chiragkyal and neha037 May 4, 2026 09:24

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 4, 2026

rausingh-rh force-pushed the bechmarking branch from 41d6681 to c868127 Compare May 8, 2026 11:10

rausingh-rh mentioned this pull request May 8, 2026

OAPE-687: update api-implement and api-generate after running benchmarking tool across few EPs #59

Merged

rausingh-rh force-pushed the bechmarking branch from c868127 to a86c8b6 Compare May 11, 2026 09:13

rausingh-rh and others added 6 commits June 10, 2026 16:08

rausingh-rh force-pushed the bechmarking branch from 8b2bd86 to 03e1cb3 Compare June 10, 2026 10:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add self-improving benchmark pipeline for OAPE tools#52

feat: add self-improving benchmark pipeline for OAPE tools#52
rausingh-rh wants to merge 6 commits into
openshift-eng:mainfrom
rausingh-rh:bechmarking

rausingh-rh commented May 4, 2026 •

edited

Loading

Uh oh!

openshift-ci Bot commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

rausingh-rh commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How it works

Metrics

How others can use it

Tested on

New files

Test plan

Uh oh!

openshift-ci Bot commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rausingh-rh commented May 4, 2026 •

edited

Loading