Skip to content

Commit b92d51c

Browse files
committed
docs: add v1 blog draft and retrieval breakdown update
1 parent bce4351 commit b92d51c

File tree

2 files changed

+227
-0
lines changed

2 files changed

+227
-0
lines changed

docs/codescalebench_blog_v1.md

Lines changed: 173 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,173 @@
1+
# CodeScaleBench: Testing Coding Agents on Large Codebases and Multi-Repo Software Engineering Tasks
2+
3+
_Alternate title: "Existing benchmarks are weak for evaluating enterprise-scale coding agents, so I built my own."_
4+
5+
In January I wrote about my frustrations with coding-agent benchmarks and why most of them do not answer the practical questions I care about. CodeScaleBench is the result: a benchmark designed to test coding agents on large codebases, multi-repo workflows, and tasks across the full software development lifecycle (SDLC), not just bug-fix micro-slices.
6+
7+
## Why I Built This
8+
9+
Most benchmark suites are strong in one narrow direction and weak in the rest:
10+
- small or single-repo scope
11+
- mostly one language family (often Python-heavy)
12+
- weak or gameable verification
13+
- poor auditability (limited or no transcript-level inspection)
14+
- leaderboard-friendly summaries that hide important failure modes
15+
16+
What I wanted:
17+
1. Large codebases (ideally 1M+ LOC, including very large repos).
18+
2. Multi-language coverage.
19+
3. Multi-repo tasks.
20+
4. SDLC coverage: understand, design, feature, fix, test, docs, refactor, secure, debug.
21+
5. Retrieval-aware evaluation (did the agent find the right context, and did that help?).
22+
23+
## What CodeScaleBench Is
24+
25+
CodeScaleBench is currently:
26+
- **370 paired tasks total**
27+
- **CodeScaleBench-SDLC**: 150 tasks across SDLC phases (direct code/task verifiers)
28+
- **CodeScaleBench-Org**: 220 org-scale discovery tasks (artifact verifier on `answer.json`)
29+
- **9 languages** across **40+ repositories**
30+
31+
Two run conditions per task:
32+
- **Baseline**: local code + standard local tools
33+
- **MCP-augmented**: no local source, Sourcegraph MCP tools required
34+
35+
This is intentionally conservative for MCP: baseline has complete local access, while MCP must retrieve context remotely.
36+
37+
## Setup Summary
38+
39+
I evaluate the same task under baseline vs MCP to isolate retrieval/access-method effects.
40+
41+
For MCP runs:
42+
- repositories are mirrored at pinned commits to ensure exact-version retrieval
43+
- the agent gets Sourcegraph MCP tools (keyword search, semantic search, symbol navigation, dependency tracing, file reads, etc.)
44+
45+
## What I Adapted vs. What I Dropped
46+
47+
| Benchmark | Status | Notes |
48+
|---|---|---|
49+
| SWE-Bench Pro | Adapted | Useful issue-resolution tasks across languages. |
50+
| LinuxFLBench | Adapted | Large-codebase fault-localization stress tests. |
51+
| Qodo Code Review | Adapted | Used with synthetic defect injection. |
52+
| TheAgentCompany | Adapted | One task retained (`bustub-hyperloglog-impl-001`). |
53+
| RepoQA | Concepts reused | Ceiling saturation; replaced by harder large-repo tasks. |
54+
| ContextBench | Used for curation | Used to calibrate curator-agent GT automation. |
55+
| DIBench / DependEval / LoCoBench | Dropped | Not suitable for repo-grounded MCP evaluation in this framework. |
56+
57+
Most SDLC tasks and all Org tasks are original, pinned to real repository states.
58+
59+
## Headline Outcome (Current Analysis Snapshot)
60+
61+
From the current analysis set:
62+
- **Overall reward delta (MCP - baseline): +0.0349**
63+
- **SDLC delta: +0.0363**
64+
- **Org delta: +0.0339**
65+
66+
Single-number summaries are directionally useful, but not sufficient. The value is task-type dependent.
67+
68+
## Where MCP Shows the Most Value
69+
70+
Largest suite-level gains are concentrated in retrieval-heavy work:
71+
- SDLC: strongest gains in **Understand**, **Refactor**, **Fix**
72+
- Org: strongest gains in **Incident** and **Security**; these are often cross-repo and high-context tasks
73+
74+
This aligns with the expected value proposition: MCP helps most when relevant context is distributed and non-local.
75+
76+
## Updated Retrieval Breakdown (Newly Curated Ground Truth, `runs/analysis`)
77+
78+
I recomputed retrieval metrics for overlap tasks with both pre-existing and curated ground truth variants.
79+
Source artifact: `results/ir/baseline_vs_mcp_breakdown_org_sdlc_runs_analysis_20260304.json`.
80+
81+
Scored tasks in this slice:
82+
- Org: 206
83+
- SDLC: 123
84+
- Combined: 329
85+
86+
### Curated GT (`ground_truth_agent.json` / `oracle_answer_agent.json`)
87+
88+
| Group | n | P@5 (BL/MCP) | R@5 (BL/MCP) | F1@5 (BL/MCP) | P@10 (BL/MCP) | R@10 (BL/MCP) | F1@10 (BL/MCP) | Total File Recall (BL/MCP) |
89+
|---|---:|---|---|---|---|---|---|---|
90+
| Org | 206 | 0.000 / 0.365 | 0.000 / 0.262 | 0.000 / 0.275 | 0.001 / 0.245 | 0.001 / 0.314 | 0.001 / 0.246 | 0.001 / 0.322 |
91+
| SDLC | 123 | 0.361 / 0.455 | 0.272 / 0.373 | 0.268 / 0.350 | 0.242 / 0.293 | 0.327 / 0.431 | 0.239 / 0.297 | 0.345 / 0.438 |
92+
| Combined | 329 | 0.135 / 0.399 | 0.102 / 0.304 | 0.100 / 0.303 | 0.091 / 0.263 | 0.123 / 0.358 | 0.090 / 0.265 | 0.129 / 0.365 |
93+
94+
### Pre-existing GT (`ground_truth.json` / `oracle_answer.json`)
95+
96+
| Group | n | P@5 (BL/MCP) | R@5 (BL/MCP) | F1@5 (BL/MCP) | P@10 (BL/MCP) | R@10 (BL/MCP) | F1@10 (BL/MCP) | Total File Recall (BL/MCP) |
97+
|---|---:|---|---|---|---|---|---|---|
98+
| Org | 206 | 0.000 / 0.122 | 0.000 / 0.121 | 0.000 / 0.113 | 0.000 / 0.074 | 0.000 / 0.137 | 0.000 / 0.090 | 0.000 / 0.139 |
99+
| SDLC | 123 | 0.296 / 0.379 | 0.288 / 0.405 | 0.262 / 0.347 | 0.192 / 0.231 | 0.335 / 0.458 | 0.216 / 0.274 | 0.347 / 0.471 |
100+
| Combined | 329 | 0.111 / 0.218 | 0.108 / 0.227 | 0.098 / 0.200 | 0.072 / 0.133 | 0.125 / 0.257 | 0.081 / 0.159 | 0.130 / 0.263 |
101+
102+
## MCP Value Highlights from the New Retrieval Slices
103+
104+
### 1) Multi-repo tasks benefit more than single-repo tasks
105+
106+
Curated GT deltas (`MCP - baseline`, combined):
107+
- `single_repo` (n=159): **F1@10 +0.1075**, **Total Recall +0.1658**
108+
- `multi_repo` (n=170): **F1@10 +0.2387**, **Total Recall +0.3017**
109+
110+
### 2) Gains persist across size bins, with strongest lift in 1M-5M proxy bucket
111+
112+
Curated GT deltas (`MCP - baseline`):
113+
- `<1M`: F1@10 +0.1047, Total +0.1736
114+
- `1M-5M`: F1@10 +0.3417, Total +0.4148
115+
- `5M-20M`: F1@10 +0.0696, Total +0.0960
116+
- `>20M`: F1@10 +0.1653, Total +0.2104
117+
118+
Interpretation: retrieval lift is not uniform, but MCP shows clear upside where task context is more distributed and retrieval-heavy.
119+
120+
## Cost and Speed
121+
122+
Current paired means:
123+
- mean cost delta: **+$0.040/task**
124+
- wall-clock delta: **-36.22s**
125+
- agent execution delta: **-101.06s**
126+
127+
So the current tradeoff is: slightly higher spend, materially faster completion.
128+
129+
## Tool-Use Pattern
130+
131+
Agents heavily favor keyword search and file reads. Deep Search remains rarely used organically.
132+
133+
This suggests prompt/tool-policy design still matters: better capability exists than what default behavior frequently exploits.
134+
135+
## Auditing Matters
136+
137+
Every run emits:
138+
- `result.json` (score, timing, metadata)
139+
- full trajectory/transcript with tool calls
140+
141+
These traces are essential. They exposed benchmark bugs, prompt contamination, verifier issues, and environment loopholes (including a git-history bypass incident) that would have silently distorted results if not audited.
142+
143+
## Quality Assurance Is Most of the Work
144+
145+
Benchmark quality gates check:
146+
1. Task validity
147+
2. Outcome validity
148+
3. Reporting completeness
149+
4. Reproducibility
150+
5. Tool effectiveness
151+
6. Statistical validity
152+
153+
Without this, benchmark claims become fragile very quickly.
154+
155+
## What This Means
156+
157+
The current signal is not “MCP always wins.”
158+
The signal is:
159+
- MCP has measurable value, especially in cross-repo and context-heavy discovery tasks.
160+
- The effect is heterogeneous across task families.
161+
- Retrieval quality improvements do not always map linearly to reward outcomes.
162+
163+
That is exactly why this benchmark is structured around SDLC and org-use-case slices instead of a single aggregate score.
164+
165+
## What’s Next
166+
167+
Planned next steps:
168+
1. Expand multi-run coverage to reduce non-determinism noise.
169+
2. Evaluate additional harnesses (Codex, Cursor, Gemini, Copilot, OpenHands).
170+
3. Compare alternate MCP providers on the same task set.
171+
4. Run tool-policy experiments (especially semantic/deep-search nudges).
172+
5. Continue tightening verifier and QA infrastructure before final white paper publication.
173+

docs/technical_reports/TECHNICAL_REPORT_V2.md

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -949,6 +949,60 @@ This indicates retrieval quality remains moderate on computable tasks, but groun
949949
950950
MCP runs show higher recall and slightly higher ranking/efficiency metrics on computable retrieval tasks.
951951
952+
### 11.5.1 Retrieval Breakdown on Newly Curated Ground Truth (runs/analysis)
953+
954+
To isolate retrieval quality effects on the currently curated task set, we recomputed baseline-vs-MCP file-level metrics directly from `runs/analysis/**/agent/trajectory.json` and compared:
955+
- pre-existing ground truth (`ground_truth.json` / `oracle_answer.json`), and
956+
- curated ground truth (`ground_truth_agent.json` / `oracle_answer_agent.json`).
957+
958+
Output artifact: `results/ir/baseline_vs_mcp_breakdown_org_sdlc_runs_analysis_20260304.json`.
959+
960+
Coverage in this slice:
961+
- Scored task pairs: **329** (`org=206`, `sdlc=123`)
962+
- Metrics shown: Precision@5, Recall@5, F1@5, Precision@10, Recall@10, F1@10, and full-set `total_file_recall`
963+
964+
#### Curated Ground Truth (preferred)
965+
966+
| Group | n | P@5 (BL / MCP) | R@5 (BL / MCP) | F1@5 (BL / MCP) | P@10 (BL / MCP) | R@10 (BL / MCP) | F1@10 (BL / MCP) | Total File Recall (BL / MCP) |
967+
|-------|---:|----------------|----------------|-----------------|-----------------|-----------------|------------------|-------------------------------|
968+
| Org | 206 | 0.000 / 0.365 | 0.000 / 0.262 | 0.000 / 0.275 | 0.001 / 0.245 | 0.001 / 0.314 | 0.001 / 0.246 | 0.001 / 0.322 |
969+
| SDLC | 123 | 0.361 / 0.455 | 0.272 / 0.373 | 0.268 / 0.350 | 0.242 / 0.293 | 0.327 / 0.431 | 0.239 / 0.297 | 0.345 / 0.438 |
970+
| Combined | 329 | 0.135 / 0.399 | 0.102 / 0.304 | 0.100 / 0.303 | 0.091 / 0.263 | 0.123 / 0.358 | 0.090 / 0.265 | 0.129 / 0.365 |
971+
972+
Key interpretation:
973+
- MCP improves retrieval substantially on both benchmark families in the curated set.
974+
- Largest absolute lift appears on Org tasks, where baseline retrieval against curated oracle is near-zero while MCP reaches meaningful coverage.
975+
976+
#### Pre-existing Ground Truth (for continuity)
977+
978+
| Group | n | P@5 (BL / MCP) | R@5 (BL / MCP) | F1@5 (BL / MCP) | P@10 (BL / MCP) | R@10 (BL / MCP) | F1@10 (BL / MCP) | Total File Recall (BL / MCP) |
979+
|-------|---:|----------------|----------------|-----------------|-----------------|-----------------|------------------|-------------------------------|
980+
| Org | 206 | 0.000 / 0.122 | 0.000 / 0.121 | 0.000 / 0.113 | 0.000 / 0.074 | 0.000 / 0.137 | 0.000 / 0.090 | 0.000 / 0.139 |
981+
| SDLC | 123 | 0.296 / 0.379 | 0.288 / 0.405 | 0.262 / 0.347 | 0.192 / 0.231 | 0.335 / 0.458 | 0.216 / 0.274 | 0.347 / 0.471 |
982+
| Combined | 329 | 0.111 / 0.218 | 0.108 / 0.227 | 0.098 / 0.200 | 0.072 / 0.133 | 0.125 / 0.257 | 0.081 / 0.159 | 0.130 / 0.263 |
983+
984+
#### Correlation Slices: Multi-Repo and Size Effects
985+
986+
On curated ground truth (`MCP - Baseline`, combined):
987+
988+
| Slice | n | Δ F1@10 | Δ Total File Recall |
989+
|-------|---:|--------:|--------------------:|
990+
| single_repo | 159 | +0.1075 | +0.1658 |
991+
| multi_repo | 170 | +0.2387 | +0.3017 |
992+
993+
Curated size-bin deltas (`MCP - Baseline`):
994+
995+
| Size Bin (proxy) | n | Δ F1@10 | Δ Total File Recall |
996+
|------------------|---:|--------:|--------------------:|
997+
| <1M | 144 | +0.1047 | +0.1736 |
998+
| 1M-5M | 99 | +0.3417 | +0.4148 |
999+
| 5M-20M | 57 | +0.0696 | +0.0960 |
1000+
| >20M | 29 | +0.1653 | +0.2104 |
1001+
1002+
These slices indicate MCP retrieval gains are larger on multi-repo tasks than single-repo tasks in this snapshot.
1003+
1004+
Methodology note: size bins here are metadata-driven proxies (`repo_set` fixture LOC totals for Org; `context_length` proxy for SDLC with fallback), so they should be interpreted as directional rather than exact physical repository size measurements.
1005+
9521006
### 11.6 Correlation Analysis
9531007
9541008
Correlation was recomputed from the retrieval analysis output (`docs/analysis/ir_analysis_analysis_set_20260303.json`):

0 commit comments

Comments
 (0)