Skip to content

Commit dcb1bac

Browse files
sjarmakclaude
andcommitted
Update IR metrics after openlibrary baseline promotion
- Run normalize_retrieval_events + retrieval_eval_pipeline on all official runs - New baseline has 10 local retrieval events (8 file_read, 2 file_write), 1/11 ground truth files hit (file_recall=0.091) - Aggregate IR: 618 computable tasks (was 594), 1084 event files (was 1005) - Update technical report IR tables with exact pipeline outputs - Fix slice distributions: file reads 376 (was 364), code search 125 (was 115), MCP retrieval 243/618 (was 229/594) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent d7ab5ec commit dcb1bac

File tree

5 files changed

+733
-136
lines changed

5 files changed

+733
-136
lines changed

docs/technical_reports/TECHNICAL_REPORT_V1.md

Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -943,19 +943,19 @@ The strongest pattern in this paired slice is by `context_length`: the larger pr
943943
944944
### 11.6 Information Retrieval Metrics
945945
946-
The IR evaluation pipeline (Section 8) produces file-level recall, MRR, MAP, nDCG, context efficiency, and utilization probes for tasks with ground truth file sets. Results from the full pipeline run (n=594 computable tasks out of 1,005 event files):
946+
The IR evaluation pipeline (Section 8) produces file-level recall, MRR, MAP, nDCG, context efficiency, and utilization probes for tasks with ground truth file sets. Results from the full pipeline run (n=618 computable tasks out of 1,084 event files):
947947
948948
**Aggregate File-Level IR Metrics:**
949949
950950
| Metric | Mean | Median | Std | n |
951951
|--------|------|--------|-----|---|
952-
| File Recall | 0.375 | 0.111 | 0.424 | 594 |
953-
| MRR | 0.347 | 0.007 | 0.443 | 594 |
954-
| MAP | 0.232 | 0.008 | 0.340 | 594 |
955-
| Context Efficiency | 0.190 | 0.013 | 0.280 | 594 |
956-
| Precision@1 | 0.298 | 0.000 | 0.458 | 594 |
957-
| Recall@5 | 0.223 | 0.000 | 0.345 | 594 |
958-
| nDCG@10 | 0.275 | 0.000 | 0.371 | 594 |
952+
| File Recall | 0.374 | 0.125 | 0.421 | 618 |
953+
| MRR | 0.351 | 0.009 | 0.444 | 618 |
954+
| MAP | 0.230 | 0.010 | 0.337 | 618 |
955+
| Context Efficiency | 0.192 | 0.019 | 0.279 | 618 |
956+
| Precision@1 | 0.301 | 0.000 | 0.459 | 618 |
957+
| Recall@5 | 0.220 | 0.000 | 0.342 | 618 |
958+
| nDCG@10 | 0.274 | 0.000 | 0.369 | 618 |
959959
960960
**High-Confidence Subset** (medium/high-confidence ground truth, n=26):
961961
@@ -967,25 +967,25 @@ The IR evaluation pipeline (Section 8) produces file-level recall, MRR, MAP, nDC
967967
| Context Efficiency | 0.432 | 0.287 |
968968
| TTFR | 24.9s | 11.1s |
969969
970-
**Utilization Probes** (n=594):
970+
**Utilization Probes** (n=618):
971971
972972
| Probe | Mean | Median |
973973
|-------|------|--------|
974-
| Read Overlap with Relevant Files | 0.337 | 0.093 |
975-
| Write Overlap with Relevant Files | 0.056 | 0.000 |
976-
| Read-Before-Write Ratio | 0.195 | 0.000 |
974+
| Read Overlap with Relevant Files | 0.335 | 0.111 |
975+
| Write Overlap with Relevant Files | 0.055 | 0.000 |
976+
| Read-Before-Write Ratio | 0.194 | 0.000 |
977977
978-
**Error Taxonomy** (n=594):
978+
**Error Taxonomy** (n=618):
979979
980980
| Error Type | Mean Count | Median |
981981
|------------|-----------|--------|
982-
| Irrelevant Retrieval | 39.7 | 7.0 |
982+
| Irrelevant Retrieval | 39.1 | 8.0 |
983983
| Missed Key Evidence | 5.8 | 3.0 |
984984
| Wrong Evidence Used | 2.2 | 1.0 |
985985
| Unused Correct Retrieval | 2.2 | 0.0 |
986-
| Ambiguity Near Miss | 17.2 | 0.0 |
986+
| Ambiguity Near Miss | 16.9 | 0.0 |
987987
988-
**Retrieval-Outcome Correlation:** Spearman rho = 0.078 (p=0.737, n=26 high-confidence tasks), indicating negligible correlation between retrieval quality (MRR) and task outcome (reward) in the current sample. The wide median-mean gaps across all IR metrics reflect a bimodal distribution: agents either find the right files early (high MRR) or miss them entirely (MRR=0). The dominant retrieval strategy is file reads (364 tasks), followed by code search (115 tasks), with MCP-based retrieval accounting for 229 of 594 evidence traces.
988+
**Retrieval-Outcome Correlation:** Spearman rho = 0.078 (p=0.737, n=26 high-confidence tasks), indicating negligible correlation between retrieval quality (MRR) and task outcome (reward) in the current sample. The wide median-mean gaps across all IR metrics reflect a bimodal distribution: agents either find the right files early (high MRR) or miss them entirely (MRR=0). The dominant retrieval strategy is file reads (376 tasks), followed by code search (125 tasks), with MCP-based retrieval accounting for 243 of 618 evidence traces.
989989
990990
### 11.7 MCP Tool Usage Patterns
991991
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,266 @@
1+
{
2+
"schema_version": "1.0",
3+
"generated_at": "2026-02-27T17:41:30.859784+00:00",
4+
"provenance": {
5+
"run_id": "ccb_fix_haiku_20260227_151833",
6+
"batch_timestamp": "2026-02-27__15-18-42",
7+
"task_name": "openlibrary-solr-boolean-fix-001",
8+
"trial_name": "openlibrary-solr-boolean-fix-001__QG7rmi9",
9+
"config_name": "baseline-local-direct",
10+
"benchmark": "unknown",
11+
"model": "claude-haiku-4-5-20251001"
12+
},
13+
"coverage": {
14+
"has_trajectory": true,
15+
"has_transcript": true,
16+
"has_ground_truth": true,
17+
"has_chunk_ground_truth": false,
18+
"trace_source": "merged",
19+
"degraded_reason": null,
20+
"ground_truth_source": "patch",
21+
"ground_truth_confidence": "high"
22+
},
23+
"ground_truth": {
24+
"files": [
25+
"docker-compose.yml",
26+
"openlibrary/core/bookshelves.py",
27+
"openlibrary/core/models.py",
28+
"openlibrary/core/ratings.py",
29+
"openlibrary/macros/SearchResultsWork.html",
30+
"openlibrary/macros/StarRatings.html",
31+
"openlibrary/plugins/upstream/mybooks.py",
32+
"openlibrary/plugins/worksearch/code.py",
33+
"openlibrary/templates/account/books.html",
34+
"openlibrary/templates/account/reading_log.html",
35+
"openlibrary/utils/solr.py"
36+
],
37+
"expected_edit_files": [
38+
"docker-compose.yml",
39+
"openlibrary/core/bookshelves.py",
40+
"openlibrary/core/models.py",
41+
"openlibrary/core/ratings.py",
42+
"openlibrary/macros/SearchResultsWork.html",
43+
"openlibrary/macros/StarRatings.html",
44+
"openlibrary/plugins/upstream/mybooks.py",
45+
"openlibrary/plugins/worksearch/code.py",
46+
"openlibrary/templates/account/books.html",
47+
"openlibrary/templates/account/reading_log.html",
48+
"openlibrary/utils/solr.py"
49+
],
50+
"expected_edit_files_source": "patch",
51+
"expected_edit_files_confidence": "high"
52+
},
53+
"events": [
54+
{
55+
"step_index": 10,
56+
"timestamp": "2026-02-27T15:38:49.757Z",
57+
"tool_name": "Read",
58+
"tool_category": "file_read",
59+
"is_mcp": false,
60+
"arguments": {
61+
"file_path": "/workspace/tests/test_docker_compose.py"
62+
},
63+
"target_files": [
64+
"tests/test_docker_compose.py"
65+
],
66+
"target_symbols": [],
67+
"hits_ground_truth": false,
68+
"matched_ground_truth_files": [],
69+
"is_subagent": false,
70+
"cumulative_tokens": 48191,
71+
"elapsed_seconds": 74.6
72+
},
73+
{
74+
"step_index": 12,
75+
"timestamp": "2026-02-27T15:38:55.048Z",
76+
"tool_name": "Read",
77+
"tool_category": "file_read",
78+
"is_mcp": false,
79+
"arguments": {
80+
"file_path": "/workspace/compose.yaml"
81+
},
82+
"target_files": [
83+
"compose.yaml"
84+
],
85+
"target_symbols": [],
86+
"hits_ground_truth": false,
87+
"matched_ground_truth_files": [],
88+
"is_subagent": false,
89+
"cumulative_tokens": 51836,
90+
"elapsed_seconds": 79.9
91+
},
92+
{
93+
"step_index": 13,
94+
"timestamp": "2026-02-27T15:38:58.085Z",
95+
"tool_name": "Read",
96+
"tool_category": "file_read",
97+
"is_mcp": false,
98+
"arguments": {
99+
"file_path": "/workspace/openlibrary/core/bookshelves.py"
100+
},
101+
"target_files": [
102+
"openlibrary/core/bookshelves.py"
103+
],
104+
"target_symbols": [],
105+
"hits_ground_truth": true,
106+
"matched_ground_truth_files": [
107+
"openlibrary/core/bookshelves.py"
108+
],
109+
"is_subagent": false,
110+
"cumulative_tokens": 52551,
111+
"elapsed_seconds": 82.9
112+
},
113+
{
114+
"step_index": 22,
115+
"timestamp": "2026-02-27T15:41:29.806Z",
116+
"tool_name": "Read",
117+
"tool_category": "file_read",
118+
"is_mcp": false,
119+
"arguments": {
120+
"file_path": "/workspace/pyproject.toml"
121+
},
122+
"target_files": [
123+
"pyproject.toml"
124+
],
125+
"target_symbols": [],
126+
"hits_ground_truth": false,
127+
"matched_ground_truth_files": [],
128+
"is_subagent": false,
129+
"cumulative_tokens": 87628,
130+
"elapsed_seconds": 234.6
131+
},
132+
{
133+
"step_index": 26,
134+
"timestamp": "2026-02-27T15:41:51.029Z",
135+
"tool_name": "Read",
136+
"tool_category": "file_read",
137+
"is_mcp": false,
138+
"arguments": {
139+
"file_path": "/workspace/scripts/solr_builder/compose.yaml"
140+
},
141+
"target_files": [
142+
"scripts/solr_builder/compose.yaml"
143+
],
144+
"target_symbols": [],
145+
"hits_ground_truth": false,
146+
"matched_ground_truth_files": [],
147+
"is_subagent": false,
148+
"cumulative_tokens": 99767,
149+
"elapsed_seconds": 255.9
150+
},
151+
{
152+
"step_index": 28,
153+
"timestamp": "2026-02-27T15:42:07.263Z",
154+
"tool_name": "Read",
155+
"tool_category": "file_read",
156+
"is_mcp": false,
157+
"arguments": {
158+
"file_path": "/workspace/scripts/solr_builder/compose.yaml"
159+
},
160+
"target_files": [
161+
"scripts/solr_builder/compose.yaml"
162+
],
163+
"target_symbols": [],
164+
"hits_ground_truth": false,
165+
"matched_ground_truth_files": [],
166+
"is_subagent": false,
167+
"cumulative_tokens": 107703,
168+
"elapsed_seconds": 272.1
169+
},
170+
{
171+
"step_index": 29,
172+
"timestamp": "2026-02-27T15:42:10.637Z",
173+
"tool_name": "Edit",
174+
"tool_category": "file_write",
175+
"is_mcp": false,
176+
"arguments": {
177+
"file_path": "/workspace/scripts/solr_builder/compose.yaml",
178+
"old_string": " - SOLR_OPTS=\n -Dsolr.autoCommit.maxTime=3600000\n -Dsolr.environment=test",
179+
"new_string": " - SOLR_OPTS=\n -Dsolr.autoCommit.maxTime=3600000\n -Dsolr.max.booleanClauses=30000\n -Dsolr.environment=test",
180+
"replace_all": false
181+
},
182+
"target_files": [
183+
"scripts/solr_builder/compose.yaml"
184+
],
185+
"target_symbols": [],
186+
"hits_ground_truth": false,
187+
"matched_ground_truth_files": [],
188+
"is_subagent": false,
189+
"cumulative_tokens": 112164,
190+
"elapsed_seconds": 275.5
191+
},
192+
{
193+
"step_index": 36,
194+
"timestamp": "2026-02-27T15:43:06.537Z",
195+
"tool_name": "Read",
196+
"tool_category": "file_read",
197+
"is_mcp": false,
198+
"arguments": {
199+
"file_path": "/workspace/setup.py"
200+
},
201+
"target_files": [
202+
"setup.py"
203+
],
204+
"target_symbols": [],
205+
"hits_ground_truth": false,
206+
"matched_ground_truth_files": [],
207+
"is_subagent": false,
208+
"cumulative_tokens": 120607,
209+
"elapsed_seconds": 331.4
210+
},
211+
{
212+
"step_index": 43,
213+
"timestamp": "2026-02-27T15:43:55.978Z",
214+
"tool_name": "Read",
215+
"tool_category": "file_read",
216+
"is_mcp": false,
217+
"arguments": {
218+
"file_path": "/workspace/tests/test_docker_compose.py"
219+
},
220+
"target_files": [
221+
"tests/test_docker_compose.py"
222+
],
223+
"target_symbols": [],
224+
"hits_ground_truth": false,
225+
"matched_ground_truth_files": [],
226+
"is_subagent": false,
227+
"cumulative_tokens": 131329,
228+
"elapsed_seconds": 380.8
229+
},
230+
{
231+
"step_index": 44,
232+
"timestamp": "2026-02-27T15:44:01.767Z",
233+
"tool_name": "Edit",
234+
"tool_category": "file_write",
235+
"is_mcp": false,
236+
"arguments": {
237+
"file_path": "/workspace/tests/test_docker_compose.py",
238+
"old_string": " def test_shared_constants(self):\n # read the value in compose.yaml\n with open(p(\"..\", \"compose.yaml\")) as f:\n prod_dc: dict = yaml.safe_load(f)\n solr_service = prod_dc[\"services\"][\"solr\"]\n solr_opts = next(var.split(\"=\", 1)[1] for var in solr_service[\"environment\"] if var.startswith(\"SOLR_OPTS=\"))\n solr_opts_max_boolean_clauses = next(int(opt.split(\"=\", 1)[1]) for opt in solr_opts.split() if opt.startswith(\"-Dsolr.max.booleanClauses\"))\n\n # read the value in openlibrary/core/bookshelves.py\n from openlibrary.core.bookshelves import FILTER_BOOK_LIMIT\n\n assert solr_opts_max_boolean_clauses >= FILTER_BOOK_LIMIT",
239+
"new_string": " def test_shared_constants(self):\n import re\n\n # read the value in compose.yaml\n with open(p(\"..\", \"compose.yaml\")) as f:\n prod_dc: dict = yaml.safe_load(f)\n solr_service = prod_dc[\"services\"][\"solr\"]\n solr_opts = next(var.split(\"=\", 1)[1] for var in solr_service[\"environment\"] if var.startswith(\"SOLR_OPTS=\"))\n solr_opts_max_boolean_clauses = next(int(opt.split(\"=\", 1)[1]) for opt in solr_opts.split() if opt.startswith(\"-Dsolr.max.booleanClauses\"))\n\n # read the value in openlibrary/core/bookshelves.py using regex to avoid Python 3.12+ syntax issues\n with open(p(\"..\", \"openlibrary/core/bookshelves.py\")) as f:\n content = f.read()\n match = re.search(r'FILTER_BOOK_LIMIT:\\s*Final\\s*=\\s*(\\d+(?:_\\d+)*)', content)\n assert match, \"FILTER_BOOK_LIMIT not found in bookshelves.py\"\n filter_book_limit = int(match.group(1).replace('_', ''))\n\n assert solr_opts_max_boolean_clauses >= filter_book_limit",
240+
"replace_all": false
241+
},
242+
"target_files": [
243+
"tests/test_docker_compose.py"
244+
],
245+
"target_symbols": [],
246+
"hits_ground_truth": false,
247+
"matched_ground_truth_files": [],
248+
"is_subagent": false,
249+
"cumulative_tokens": 134314,
250+
"elapsed_seconds": 386.6
251+
}
252+
],
253+
"summary": {
254+
"total_events": 10,
255+
"mcp_events": 0,
256+
"local_events": 10,
257+
"unique_files_accessed": 6,
258+
"ground_truth_files_hit": 1,
259+
"ground_truth_files_total": 11,
260+
"first_ground_truth_hit_step": 13,
261+
"events_by_category": {
262+
"file_read": 8,
263+
"file_write": 2
264+
}
265+
}
266+
}

0 commit comments

Comments
 (0)