|
| 1 | +# CodeScaleBench: Testing Coding Agents on Large Codebases and Multi-Repo Software Engineering Tasks |
| 2 | + |
| 3 | +_Alternate title: "Existing benchmarks are weak for evaluating enterprise-scale coding agents, so I built my own."_ |
| 4 | + |
| 5 | +In January I wrote about my frustrations with coding-agent benchmarks and why most of them do not answer the practical questions I care about. CodeScaleBench is the result: a benchmark designed to test coding agents on large codebases, multi-repo workflows, and tasks across the full software development lifecycle (SDLC), not just bug-fix micro-slices. |
| 6 | + |
| 7 | +## Why I Built This |
| 8 | + |
| 9 | +Most benchmark suites are strong in one narrow direction and weak in the rest: |
| 10 | +- small or single-repo scope |
| 11 | +- mostly one language family (often Python-heavy) |
| 12 | +- weak or gameable verification |
| 13 | +- poor auditability (limited or no transcript-level inspection) |
| 14 | +- leaderboard-friendly summaries that hide important failure modes |
| 15 | + |
| 16 | +What I wanted: |
| 17 | +1. Large codebases (ideally 1M+ LOC, including very large repos). |
| 18 | +2. Multi-language coverage. |
| 19 | +3. Multi-repo tasks. |
| 20 | +4. SDLC coverage: understand, design, feature, fix, test, docs, refactor, secure, debug. |
| 21 | +5. Retrieval-aware evaluation (did the agent find the right context, and did that help?). |
| 22 | + |
| 23 | +## What CodeScaleBench Is |
| 24 | + |
| 25 | +CodeScaleBench is currently: |
| 26 | +- **370 paired tasks total** |
| 27 | +- **CodeScaleBench-SDLC**: 150 tasks across SDLC phases (direct code/task verifiers) |
| 28 | +- **CodeScaleBench-Org**: 220 org-scale discovery tasks (artifact verifier on `answer.json`) |
| 29 | +- **9 languages** across **40+ repositories** |
| 30 | + |
| 31 | +Two run conditions per task: |
| 32 | +- **Baseline**: local code + standard local tools |
| 33 | +- **MCP-augmented**: no local source, Sourcegraph MCP tools required |
| 34 | + |
| 35 | +This is intentionally conservative for MCP: baseline has complete local access, while MCP must retrieve context remotely. |
| 36 | + |
| 37 | +## Setup Summary |
| 38 | + |
| 39 | +I evaluate the same task under baseline vs MCP to isolate retrieval/access-method effects. |
| 40 | + |
| 41 | +For MCP runs: |
| 42 | +- repositories are mirrored at pinned commits to ensure exact-version retrieval |
| 43 | +- the agent gets Sourcegraph MCP tools (keyword search, semantic search, symbol navigation, dependency tracing, file reads, etc.) |
| 44 | + |
| 45 | +## What I Adapted vs. What I Dropped |
| 46 | + |
| 47 | +| Benchmark | Status | Notes | |
| 48 | +|---|---|---| |
| 49 | +| SWE-Bench Pro | Adapted | Useful issue-resolution tasks across languages. | |
| 50 | +| LinuxFLBench | Adapted | Large-codebase fault-localization stress tests. | |
| 51 | +| Qodo Code Review | Adapted | Used with synthetic defect injection. | |
| 52 | +| TheAgentCompany | Adapted | One task retained (`bustub-hyperloglog-impl-001`). | |
| 53 | +| RepoQA | Concepts reused | Ceiling saturation; replaced by harder large-repo tasks. | |
| 54 | +| ContextBench | Used for curation | Used to calibrate curator-agent GT automation. | |
| 55 | +| DIBench / DependEval / LoCoBench | Dropped | Not suitable for repo-grounded MCP evaluation in this framework. | |
| 56 | + |
| 57 | +Most SDLC tasks and all Org tasks are original, pinned to real repository states. |
| 58 | + |
| 59 | +## Headline Outcome (Current Analysis Snapshot) |
| 60 | + |
| 61 | +From the current analysis set: |
| 62 | +- **Overall reward delta (MCP - baseline): +0.0349** |
| 63 | +- **SDLC delta: +0.0363** |
| 64 | +- **Org delta: +0.0339** |
| 65 | + |
| 66 | +Single-number summaries are directionally useful, but not sufficient. The value is task-type dependent. |
| 67 | + |
| 68 | +## Where MCP Shows the Most Value |
| 69 | + |
| 70 | +Largest suite-level gains are concentrated in retrieval-heavy work: |
| 71 | +- SDLC: strongest gains in **Understand**, **Refactor**, **Fix** |
| 72 | +- Org: strongest gains in **Incident** and **Security**; these are often cross-repo and high-context tasks |
| 73 | + |
| 74 | +This aligns with the expected value proposition: MCP helps most when relevant context is distributed and non-local. |
| 75 | + |
| 76 | +## Updated Retrieval Breakdown (Newly Curated Ground Truth, `runs/analysis`) |
| 77 | + |
| 78 | +I recomputed retrieval metrics for overlap tasks with both pre-existing and curated ground truth variants. |
| 79 | +Source artifact: `results/ir/baseline_vs_mcp_breakdown_org_sdlc_runs_analysis_20260304.json`. |
| 80 | + |
| 81 | +Scored tasks in this slice: |
| 82 | +- Org: 206 |
| 83 | +- SDLC: 123 |
| 84 | +- Combined: 329 |
| 85 | + |
| 86 | +### Curated GT (`ground_truth_agent.json` / `oracle_answer_agent.json`) |
| 87 | + |
| 88 | +| Group | n | P@5 (BL/MCP) | R@5 (BL/MCP) | F1@5 (BL/MCP) | P@10 (BL/MCP) | R@10 (BL/MCP) | F1@10 (BL/MCP) | Total File Recall (BL/MCP) | |
| 89 | +|---|---:|---|---|---|---|---|---|---| |
| 90 | +| Org | 206 | 0.000 / 0.365 | 0.000 / 0.262 | 0.000 / 0.275 | 0.001 / 0.245 | 0.001 / 0.314 | 0.001 / 0.246 | 0.001 / 0.322 | |
| 91 | +| SDLC | 123 | 0.361 / 0.455 | 0.272 / 0.373 | 0.268 / 0.350 | 0.242 / 0.293 | 0.327 / 0.431 | 0.239 / 0.297 | 0.345 / 0.438 | |
| 92 | +| Combined | 329 | 0.135 / 0.399 | 0.102 / 0.304 | 0.100 / 0.303 | 0.091 / 0.263 | 0.123 / 0.358 | 0.090 / 0.265 | 0.129 / 0.365 | |
| 93 | + |
| 94 | +### Pre-existing GT (`ground_truth.json` / `oracle_answer.json`) |
| 95 | + |
| 96 | +| Group | n | P@5 (BL/MCP) | R@5 (BL/MCP) | F1@5 (BL/MCP) | P@10 (BL/MCP) | R@10 (BL/MCP) | F1@10 (BL/MCP) | Total File Recall (BL/MCP) | |
| 97 | +|---|---:|---|---|---|---|---|---|---| |
| 98 | +| Org | 206 | 0.000 / 0.122 | 0.000 / 0.121 | 0.000 / 0.113 | 0.000 / 0.074 | 0.000 / 0.137 | 0.000 / 0.090 | 0.000 / 0.139 | |
| 99 | +| SDLC | 123 | 0.296 / 0.379 | 0.288 / 0.405 | 0.262 / 0.347 | 0.192 / 0.231 | 0.335 / 0.458 | 0.216 / 0.274 | 0.347 / 0.471 | |
| 100 | +| Combined | 329 | 0.111 / 0.218 | 0.108 / 0.227 | 0.098 / 0.200 | 0.072 / 0.133 | 0.125 / 0.257 | 0.081 / 0.159 | 0.130 / 0.263 | |
| 101 | + |
| 102 | +## MCP Value Highlights from the New Retrieval Slices |
| 103 | + |
| 104 | +### 1) Multi-repo tasks benefit more than single-repo tasks |
| 105 | + |
| 106 | +Curated GT deltas (`MCP - baseline`, combined): |
| 107 | +- `single_repo` (n=159): **F1@10 +0.1075**, **Total Recall +0.1658** |
| 108 | +- `multi_repo` (n=170): **F1@10 +0.2387**, **Total Recall +0.3017** |
| 109 | + |
| 110 | +### 2) Gains persist across size bins, with strongest lift in 1M-5M proxy bucket |
| 111 | + |
| 112 | +Curated GT deltas (`MCP - baseline`): |
| 113 | +- `<1M`: F1@10 +0.1047, Total +0.1736 |
| 114 | +- `1M-5M`: F1@10 +0.3417, Total +0.4148 |
| 115 | +- `5M-20M`: F1@10 +0.0696, Total +0.0960 |
| 116 | +- `>20M`: F1@10 +0.1653, Total +0.2104 |
| 117 | + |
| 118 | +Interpretation: retrieval lift is not uniform, but MCP shows clear upside where task context is more distributed and retrieval-heavy. |
| 119 | + |
| 120 | +## Cost and Speed |
| 121 | + |
| 122 | +Current paired means: |
| 123 | +- mean cost delta: **+$0.040/task** |
| 124 | +- wall-clock delta: **-36.22s** |
| 125 | +- agent execution delta: **-101.06s** |
| 126 | + |
| 127 | +So the current tradeoff is: slightly higher spend, materially faster completion. |
| 128 | + |
| 129 | +## Tool-Use Pattern |
| 130 | + |
| 131 | +Agents heavily favor keyword search and file reads. Deep Search remains rarely used organically. |
| 132 | + |
| 133 | +This suggests prompt/tool-policy design still matters: better capability exists than what default behavior frequently exploits. |
| 134 | + |
| 135 | +## Auditing Matters |
| 136 | + |
| 137 | +Every run emits: |
| 138 | +- `result.json` (score, timing, metadata) |
| 139 | +- full trajectory/transcript with tool calls |
| 140 | + |
| 141 | +These traces are essential. They exposed benchmark bugs, prompt contamination, verifier issues, and environment loopholes (including a git-history bypass incident) that would have silently distorted results if not audited. |
| 142 | + |
| 143 | +## Quality Assurance Is Most of the Work |
| 144 | + |
| 145 | +Benchmark quality gates check: |
| 146 | +1. Task validity |
| 147 | +2. Outcome validity |
| 148 | +3. Reporting completeness |
| 149 | +4. Reproducibility |
| 150 | +5. Tool effectiveness |
| 151 | +6. Statistical validity |
| 152 | + |
| 153 | +Without this, benchmark claims become fragile very quickly. |
| 154 | + |
| 155 | +## What This Means |
| 156 | + |
| 157 | +The current signal is not “MCP always wins.” |
| 158 | +The signal is: |
| 159 | +- MCP has measurable value, especially in cross-repo and context-heavy discovery tasks. |
| 160 | +- The effect is heterogeneous across task families. |
| 161 | +- Retrieval quality improvements do not always map linearly to reward outcomes. |
| 162 | + |
| 163 | +That is exactly why this benchmark is structured around SDLC and org-use-case slices instead of a single aggregate score. |
| 164 | + |
| 165 | +## What’s Next |
| 166 | + |
| 167 | +Planned next steps: |
| 168 | +1. Expand multi-run coverage to reduce non-determinism noise. |
| 169 | +2. Evaluate additional harnesses (Codex, Cursor, Gemini, Copilot, OpenHands). |
| 170 | +3. Compare alternate MCP providers on the same task set. |
| 171 | +4. Run tool-policy experiments (especially semantic/deep-search nudges). |
| 172 | +5. Continue tightening verifier and QA infrastructure before final white paper publication. |
| 173 | + |
0 commit comments