Skip to content

Commit 7f3c426

Browse files
committed
Rename technical report file and align blog/report findings
1 parent 3106f0d commit 7f3c426

File tree

2 files changed

+7
-7
lines changed

2 files changed

+7
-7
lines changed

docs/BLOG_POST.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ I wanted to evaluate how coding agents perform in as close to an enterprise envi
1919

2020
Anyway it took longer than I thought it would, but I did it. I made a real benchmark that's useful for me and hopefully others too. CodeScaleBench is a living benchmark (this is code for I'm still working on it and am vulnerable to scope creep) that is divided into two parts. CodeScaleBench-SDLC has 150 software engineering tasks spanning the full SDLC; it uses a patch based verifier method popularized by SWE-Bench and also has a corresponding ground_truth.json file produced by a curator agent for context retrieval metrics I'll talk about later. CodeScaleBench-Org has 220 software engineering tasks that are separated into development tasks that require organization and in many cases cross repository-wide codebase navigation and understanding; it uses what I call an 'artifact' verifier where it produces an 'answer.json' file that is compared with the curator agent's solution. I built the benchmark framework, the evaluation pipeline, the ground truth system, and the statistical analysis layer using Claude Code et al. across ~1000 conversation sessions over about a month.
2121

22-
Some initial findings (that'll be expanded on later): on the current analysis snapshot (generated March 3, 2026), metrics are computed by averaging multiple runs per task/config first, then pairing those per-task means. That yields a paired reward delta of **+0.036** for CSB-SDLC (95% CI: [-0.008, +0.084]), **+0.034** for CSB-Org (95% CI: [+0.013, +0.057]), and **+0.035** overall (95% CI: [+0.013, +0.058]) across **370** baseline/MCP task pairs, with reward-delta variance **0.048985**. The highest positive suite deltas are `csb_org_incident` (+0.113), `csb_org_security` (+0.106), `csb_sdlc_understand` (+0.115), and `csb_sdlc_refactor` (+0.103). Timing and cost no longer show the earlier contradictory pattern: MCP is faster on wall-clock on average (367.11s baseline vs 330.89s MCP, −36.22s) and much faster on agent execution (−101.06s), but slightly more expensive (+$0.040/task, +13.49% on means).
22+
Some initial findings (that'll be expanded on later): on the current analysis set (generated March 3, 2026), metrics are computed by averaging multiple runs per task/config first, then pairing those per-task means. That yields a paired reward delta of **+0.036** for CSB-SDLC (95% CI: [-0.008, +0.084]), **+0.034** for CSB-Org (95% CI: [+0.013, +0.057]), and **+0.035** overall (95% CI: [+0.013, +0.058]) across **370** baseline/MCP task pairs, with reward-delta variance **0.048985**. The highest positive suite deltas are `csb_org_incident` (+0.113), `csb_org_security` (+0.106), `csb_sdlc_understand` (+0.115), and `csb_sdlc_refactor` (+0.103). MCP is also faster on wall-clock on average (367.11s baseline vs 330.89s MCP, −36.22s), much faster on agent execution (−101.06s), and in the canonical haiku cost estimate reduces average cost per task from **$0.733** to **$0.512** (**−30.16%**).
2323

2424
And by the way, building a benchmark for coding agents while using coding agents is a fun way to find new failure modes. We all know agents are sneaky and mysterious genies, and that's also why I think benchmark results should ship with full agent transcripts for auditing (talking about that later, I know I'm asking a lot of you but I promise if you like benchmarks this is interesting and also explains why you read this far).
2525

@@ -35,15 +35,15 @@ The same agent (starting with Claude Code + Haiku 4.5) runs the same task under
3535

3636
This makes it a conservative test. In real enterprise settings, the agent wouldn't have full local access to every relevant repo or the entire tens of millions of lines of a monolithic monster. But these runs of the benchmark test whether differences in context retrieval approaches, with access to the same information, change SDLC task outcomes. A future post will cover tasks that are uniquely enabled by these tools that a baseline agent just can't do at all. Though I did also find examples where local tools were insufficient even with all of the local code available, and the tasks were only possible with these retrieval tools. Like agents without these tools getting lost in massive codebases like Kubernetes, or confused about refactoring in Java repos, etc.
3737

38-
CSB-SDLC tasks are organized by SDLC phase (Understand, Design, Feature, Fix, Test, Document, Refactor, Secure, Debug), and the CSB-Org tasks are organized into organizational use cases (Dependency Tracing, Vulnerability Remediation, Framework Migration, Incident Debugging, Onboarding & Comprehension, Compliance, Cross-Org Discovery, Domain Lineage, Organizational Context, Platform Knowledge, and Cross-Repo Discovery) with many tasks including 3-20 repos. They span 40+ repositories (Kubernetes, Django, Linux, VSCode, etc.) and 9 programming languages. The full methodology, evaluation layers, and information retrieval analysis pipeline are documented in a [draft technical report](technical_reports/TECHNICAL_REPORT_V2.md).
38+
CSB-SDLC tasks are organized by SDLC phase (Understand, Design, Feature, Fix, Test, Document, Refactor, Secure, Debug), and the CSB-Org tasks are organized into organizational use cases (Dependency Tracing, Vulnerability Remediation, Framework Migration, Incident Debugging, Onboarding & Comprehension, Compliance, Cross-Org Discovery, Domain Lineage, Organizational Context, Platform Knowledge, and Cross-Repo Discovery) with many tasks including 3-20 repos. They span 40+ repositories (Kubernetes, Django, Linux, VSCode, etc.) and 9 programming languages. The full methodology, evaluation layers, and information retrieval analysis pipeline are documented in the [technical report](technical_reports/TECHNICAL_REPORT.md).
3939

4040
## What I Used (And What I Threw Out)
4141

4242
One of the first things I tried to figure out was which existing benchmarks to draw from and which to ignore entirely. I'm not looking to reinvent any wheels if I can avoid it, and if there are existing tasks out there that I can Frankenstein-patch together into some hideous benchmark then I want to find them! I selected, or mostly didn't select, from a variety of benchmarks I found listed in the table below (these are the ones I had shortlisted as most likely to contain steal-worthy candidates).
4343

4444
Side note: I also learned about the ContextBench benchmark fairly late and am not including it here (like, a few days ago, because it went live on arxiv in mid-Feb which was after my research frenzy phase and during my build and write phase), but that benchmark is largely complementary to my investigation. It's a collection of ~1000+ human-annotated context files for SWE-Bench Verified and includes an information retrieval metrics evaluation framework (I'll include a short section on how I used this info to support my benchmark and some results using their evaluation framework).
4545

46-
Most of CSB-SDLC and all of CSB-Org's tasks are original in the sense that they weren't lifted from an existing benchmark. However, each one is grounded in a real repository at a pinned commit, targeting a real development scenario pulled from GitHub issues, PRs, and codebase analysis. I designed the Org tasks using a custom use case registry and artifact evaluation setup for cross-repository code intelligence; check out the [technical report](technical_reports/TECHNICAL_REPORT_V2.md) for more details on the 'direct' SWE-bench style verifier mode for code modifications vs an 'artifact' answer.json approach.
46+
Most of CSB-SDLC and all of CSB-Org's tasks are original in the sense that they weren't lifted from an existing benchmark. However, each one is grounded in a real repository at a pinned commit, targeting a real development scenario pulled from GitHub issues, PRs, and codebase analysis. I designed the Org tasks using a custom use case registry and artifact evaluation setup for cross-repository code intelligence; check out the [technical report](technical_reports/TECHNICAL_REPORT.md) for more details on the 'direct' SWE-bench style verifier mode for code modifications vs an 'artifact' answer.json approach.
4747

4848
I also created an agentic benchmark checklist pipeline (inspired by this paper) to audit every task before it goes into a suite. It runs automated checks across three dimensions, Task Validity, Outcome Validity, and Reporting, and flags issues as PASS/FAIL/WARN/SKIP with severity-aware grading (A-F) based on critical and important criteria. It catches many structural and verifier-quality problems; it's complementary to a separate preflight runtime validation check I put in place in my (semi-futile) attempts to eliminate all failure modes (more on that in the QA section).
4949

@@ -114,7 +114,7 @@ In the refreshed data, the negative SDLC suites are Secure (-0.071), Debug (-0.0
114114

115115
Secure is currently the clearest negative signal in SDLC, followed by smaller negative deltas in debug and test.
116116

117-
Context retrieval isn't the bottleneck for every software development situation. Codebase size, harness, language, task type, prompt content all contribute. The [technical report](technical_reports/TECHNICAL_REPORT_V2.md) covers the full per-suite breakdown.
117+
Context retrieval isn't the bottleneck for every software development situation. Codebase size, harness, language, task type, prompt content all contribute. The [technical report](technical_reports/TECHNICAL_REPORT.md) covers the full per-suite breakdown.
118118

119119
## MCP Value Scales With Codebase Size
120120

@@ -275,7 +275,7 @@ Here's what the data from my benchmark says so far:
275275

276276
**Agents really don't want to use semantic / asynchronous tools.** I found that the agent overwhelmingly wanted to use keyword search and ignored Deep Search as a tool (6 tasks, 8 calls out of 602 MCP runs), and I wonder if nudging it to use more optimized tools in different scenarios would change any outcomes.
277277

278-
The [technical report](technical_reports/TECHNICAL_REPORT_V2.md) has the full methodology, statistical analysis, and evaluation pipeline details. My paper will have even more.
278+
The [technical report](technical_reports/TECHNICAL_REPORT.md) has the full methodology, statistical analysis, and evaluation pipeline details. My paper will have even more.
279279

280280
## What's Next
281281

docs/technical_reports/TECHNICAL_REPORT_V2.md renamed to docs/technical_reports/TECHNICAL_REPORT.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
# CodeScaleBench: A Systematic Evaluation Framework for Assessing the Impact of Enhanced Code Intelligence on AI Coding Agent Performance
22

3-
**White Paper Technical Report -- V2**
3+
**White Paper Technical Report**
44
**Date:** March 3, 2026
5-
**Revision:** V2
5+
**Last Modified:** March 5, 2026
66

77
---
88

0 commit comments

Comments
 (0)