Commit c85ae1b
feat: oracle cross-validation agent with multi-backend retrieval
Add three scripts for cross-validating oracle ground truth:
- context_retrieval_agent.py: Claude-powered retrieval agent with 3
backends (local/deepsearch/hybrid). Uses configurable model
(default: Sonnet 4.6). Local tools: grep, rg, find, file reading,
import search, symbol search. SG tools: deep search, keyword search.
Cross-validate mode runs all backends and reports agreement metrics.
- validate_on_contextbench.py: Validates our agent against ContextBench
(1,136 human-annotated SWE-bench tasks). Downloads dataset from
HuggingFace, converts agent output to trajectory format, evaluates
file/symbol/span recall and precision.
- cross_validate_oracles.py: Compares agent-generated oracles against
existing SG oracles (oracle_answer.json) and hand-curated ground
truth (ground_truth.json). Reports file F1, Cohen's kappa, per-suite
breakdown, and divergence analysis.
Addresses circularity concern: current oracles are 100% Sourcegraph-
generated. Cross-validation with independent local tools empirically
tests whether SG and non-SG methods agree. Also addresses SDLC coverage
gap (69/150 tasks missing ground_truth.json).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>1 parent c5b29a9 commit c85ae1b
File tree
5 files changed
+2641
-2
lines changed- docs/ops
- scripts
5 files changed
+2641
-2
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
140 | 140 | | |
141 | 141 | | |
142 | 142 | | |
| 143 | + | |
143 | 144 | | |
144 | 145 | | |
145 | 146 | | |
| |||
167 | 168 | | |
168 | 169 | | |
169 | 170 | | |
| 171 | + | |
170 | 172 | | |
| 173 | + | |
171 | 174 | | |
172 | 175 | | |
173 | 176 | | |
| |||
0 commit comments