Skip to content

Commit c85ae1b

Browse files
sjarmakclaude
andcommitted
feat: oracle cross-validation agent with multi-backend retrieval
Add three scripts for cross-validating oracle ground truth: - context_retrieval_agent.py: Claude-powered retrieval agent with 3 backends (local/deepsearch/hybrid). Uses configurable model (default: Sonnet 4.6). Local tools: grep, rg, find, file reading, import search, symbol search. SG tools: deep search, keyword search. Cross-validate mode runs all backends and reports agreement metrics. - validate_on_contextbench.py: Validates our agent against ContextBench (1,136 human-annotated SWE-bench tasks). Downloads dataset from HuggingFace, converts agent output to trajectory format, evaluates file/symbol/span recall and precision. - cross_validate_oracles.py: Compares agent-generated oracles against existing SG oracles (oracle_answer.json) and hand-curated ground truth (ground_truth.json). Reports file F1, Cohen's kappa, per-suite breakdown, and divergence analysis. Addresses circularity concern: current oracles are 100% Sourcegraph- generated. Cross-validation with independent local tools empirically tests whether SG and non-SG methods agree. Also addresses SDLC coverage gap (69/150 tasks missing ground_truth.json). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent c5b29a9 commit c85ae1b

File tree

5 files changed

+2641
-2
lines changed

5 files changed

+2641
-2
lines changed

docs/ops/SCRIPT_INDEX.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -140,6 +140,7 @@ Generated from `scripts/registry.json` by `scripts/generate_script_index.py`.
140140
## Validation
141141

142142
- `scripts/validate_enterprise_readiness.py` - Validation script for validate enterprise readiness.
143+
- `scripts/validate_on_contextbench.py` - Validation script for validate on contextbench.
143144

144145
## Generation
145146

@@ -167,7 +168,9 @@ Generated from `scripts/registry.json` by `scripts/generate_script_index.py`.
167168
- `scripts/backfill_triage_from_manifest.py` [one_off] - Historical one-off script: backfill triage from manifest.
168169
- `scripts/check_harness_readiness.py` - Utility script for check harness readiness.
169170
- `scripts/compute_bootstrap_cis.py` - Utility script for compute bootstrap cis.
171+
- `scripts/context_retrieval_agent.py` - Utility script for context retrieval agent.
170172
- `scripts/control_plane.py` - Utility script for control plane.
173+
- `scripts/cross_validate_oracles.py` - Utility script for cross validate oracles.
171174
- `scripts/daytona_poc_runner.py` - Utility script for daytona poc runner.
172175
- `scripts/daytona_runner.py` - Utility script for daytona runner.
173176
- `scripts/dependeval_eval_dr.py` - Utility script for dependeval eval dr.

0 commit comments

Comments
 (0)