Skip to content

Commit 1f61178

Browse files
committed
Add bootstrap confidence intervals to refreshed analysis
1 parent c0ed534 commit 1f61178

File tree

5 files changed

+600
-36
lines changed

5 files changed

+600
-36
lines changed

docs/BLOG_POST.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ I wanted to evaluate how coding agents perform in as close to an enterprise envi
1919

2020
Anyway it took longer than I thought it would, but I did it. I made a real benchmark that's useful for me and hopefully others too. CodeScaleBench is a living benchmark (this is code for I'm still working on it and am vulnerable to scope creep) that is divided into two parts. CodeScaleBench-SDLC has 150 software engineering tasks spanning the full SDLC; it uses a patch based verifier method popularized by SWE-Bench and also has a corresponding ground_truth.json file produced by a curator agent for context retrieval metrics I'll talk about later. CodeScaleBench-Org has 220 software engineering tasks that are separated into development tasks that require organization and in many cases cross repository-wide codebase navigation and understanding; it uses what I call an 'artifact' verifier where it produces an 'answer.json' file that is compared with the curator agent's solution. I built the benchmark framework, the evaluation pipeline, the ground truth system, and the statistical analysis layer using Claude Code et al. across ~1000 conversation sessions over about a month.
2121

22-
Some initial findings (that'll be expanded on later): on the current analysis snapshot (generated March 3, 2026), metrics are computed by averaging multiple runs per task/config first, then pairing those per-task means. That yields a paired reward delta of **+0.036** for CSB-SDLC, **+0.034** for CSB-Org, and **+0.035** overall across **370** baseline/MCP task pairs, with reward-delta variance **0.048985**. The highest positive suite deltas are `csb_org_incident` (+0.113), `csb_org_security` (+0.106), `csb_sdlc_understand` (+0.115), and `csb_sdlc_refactor` (+0.103). Timing and cost no longer show the earlier contradictory pattern: MCP is faster on wall-clock on average (367.11s baseline vs 330.89s MCP, −36.22s) and much faster on agent execution (−101.06s), but slightly more expensive (+$0.040/task, +13.49% on means).
22+
Some initial findings (that'll be expanded on later): on the current analysis snapshot (generated March 3, 2026), metrics are computed by averaging multiple runs per task/config first, then pairing those per-task means. That yields a paired reward delta of **+0.036** for CSB-SDLC (95% CI: [-0.008, +0.084]), **+0.034** for CSB-Org (95% CI: [+0.013, +0.057]), and **+0.035** overall (95% CI: [+0.013, +0.058]) across **370** baseline/MCP task pairs, with reward-delta variance **0.048985**. The highest positive suite deltas are `csb_org_incident` (+0.113), `csb_org_security` (+0.106), `csb_sdlc_understand` (+0.115), and `csb_sdlc_refactor` (+0.103). Timing and cost no longer show the earlier contradictory pattern: MCP is faster on wall-clock on average (367.11s baseline vs 330.89s MCP, −36.22s) and much faster on agent execution (−101.06s), but slightly more expensive (+$0.040/task, +13.49% on means).
2323

2424
And by the way, building a benchmark for coding agents while using coding agents is a fun way to find new failure modes. We all know agents are sneaky and mysterious genies, and that's also why I think benchmark results should ship with full agent transcripts for auditing (talking about that later, I know I'm asking a lot of you but I promise if you like benchmarks this is interesting and also explains why you read this far).
2525

@@ -69,7 +69,7 @@ Breaking it down by SDLC element (which is how I designed this side of the bench
6969
| debug | 18 | -0.019 |
7070
| secure | 7 | **-0.071** |
7171

72-
SDLC total: mean paired delta **+0.036** (n=150 paired tasks).
72+
SDLC total: mean paired delta **+0.036** (n=150 paired tasks, 95% CI: [-0.008, +0.084]).
7373

7474
## Where Sourcegraph MCP Wins
7575

@@ -89,7 +89,7 @@ From that table above, the largest SDLC gains in this snapshot are design (+0.14
8989
| platform | 18 | -0.015 |
9090
| crossrepo | 14 | -0.027 |
9191

92-
Org total: mean paired delta **+0.034** (n=220 paired tasks). When the agent needs to find information scattered across multiple repos, MCP tools still help overall in this snapshot.
92+
Org total: mean paired delta **+0.034** (n=220 paired tasks, 95% CI: [+0.013, +0.057]). When the agent needs to find information scattered across multiple repos, MCP tools still help overall in this snapshot.
9393

9494
The biggest effects are on security (+0.113) and incident debugging (+0.108). These are the tasks that look most like real enterprise work: tracing a vulnerability across a dozen repos, mapping error paths across microservices, figuring out mysterious (haunted?) codebases.
9595

docs/analysis/analysis_refresh_tables_20260303.json

Lines changed: 188 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -607,5 +607,193 @@
607607
"delta_cost_mean_usd": 0.0956,
608608
"delta_cost_variance": 0.022037
609609
}
610+
},
611+
"bootstrap_cis": {
612+
"overall": {
613+
"n": 370,
614+
"mean": 0.03488056628056628,
615+
"ci95": [
616+
0.012977608322608318,
617+
0.05788240883740884
618+
]
619+
},
620+
"sdlc": {
621+
"n": 150,
622+
"mean": 0.036298952380952376,
623+
"ci95": [
624+
-0.008323354497354503,
625+
0.08350213756613756
626+
]
627+
},
628+
"org": {
629+
"n": 220,
630+
"mean": 0.033913484848484846,
631+
"ci95": [
632+
0.01331424242424242,
633+
0.05709583333333334
634+
]
635+
},
636+
"by_suite": {
637+
"csb_org_compliance": {
638+
"n": 18,
639+
"mean": 0.015322222222222221,
640+
"ci95": [
641+
-0.03528888888888889,
642+
0.07069444444444442
643+
]
644+
},
645+
"csb_org_crossorg": {
646+
"n": 15,
647+
"mean": 0.02519333333333334,
648+
"ci95": [
649+
-0.009408888888888899,
650+
0.057195555555555555
651+
]
652+
},
653+
"csb_org_crossrepo": {
654+
"n": 14,
655+
"mean": -0.02416428571428573,
656+
"ci95": [
657+
-0.05576904761904764,
658+
0.007314285714285689
659+
]
660+
},
661+
"csb_org_crossrepo_tracing": {
662+
"n": 22,
663+
"mean": 0.0514,
664+
"ci95": [
665+
-0.004018181818181809,
666+
0.1406939393939394
667+
]
668+
},
669+
"csb_org_domain": {
670+
"n": 20,
671+
"mean": -0.01652333333333334,
672+
"ci95": [
673+
-0.05229166666666668,
674+
0.020853333333333324
675+
]
676+
},
677+
"csb_org_incident": {
678+
"n": 20,
679+
"mean": 0.11251,
680+
"ci95": [
681+
0.02462166666666666,
682+
0.23049833333333333
683+
]
684+
},
685+
"csb_org_migration": {
686+
"n": 26,
687+
"mean": 0.03808782051282053,
688+
"ci95": [
689+
-0.008658333333333317,
690+
0.10059038461538462
691+
]
692+
},
693+
"csb_org_onboarding": {
694+
"n": 28,
695+
"mean": 0.008305357142857146,
696+
"ci95": [
697+
-0.050334523809523815,
698+
0.07817321428571429
699+
]
700+
},
701+
"csb_org_org": {
702+
"n": 15,
703+
"mean": 0.0568088888888889,
704+
"ci95": [
705+
-0.00670888888888887,
706+
0.11804444444444445
707+
]
708+
},
709+
"csb_org_platform": {
710+
"n": 18,
711+
"mean": -0.02871481481481483,
712+
"ci95": [
713+
-0.08001851851851854,
714+
0.011316666666666654
715+
]
716+
},
717+
"csb_org_security": {
718+
"n": 24,
719+
"mean": 0.10570555555555557,
720+
"ci95": [
721+
0.02496736111111111,
722+
0.2102277777777778
723+
]
724+
},
725+
"csb_sdlc_debug": {
726+
"n": 18,
727+
"mean": -0.037222222222222226,
728+
"ci95": [
729+
-0.09092592592592592,
730+
0.02999999999999999
731+
]
732+
},
733+
"csb_sdlc_design": {
734+
"n": 14,
735+
"mean": 0.05137290249433107,
736+
"ci95": [
737+
-0.1050613378684807,
738+
0.21310861678004533
739+
]
740+
},
741+
"csb_sdlc_document": {
742+
"n": 13,
743+
"mean": 0.041538461538461524,
744+
"ci95": [
745+
-0.003846153846153858,
746+
0.08999999999999998
747+
]
748+
},
749+
"csb_sdlc_feature": {
750+
"n": 23,
751+
"mean": 0.013019323671497577,
752+
"ci95": [
753+
-0.1134057971014493,
754+
0.16041062801932365
755+
]
756+
},
757+
"csb_sdlc_fix": {
758+
"n": 26,
759+
"mean": 0.0986363247863248,
760+
"ci95": [
761+
0.016485470085470084,
762+
0.19674786324786323
763+
]
764+
},
765+
"csb_sdlc_refactor": {
766+
"n": 16,
767+
"mean": 0.10291666666666664,
768+
"ci95": [
769+
-0.15187500000000004,
770+
0.34249999999999997
771+
]
772+
},
773+
"csb_sdlc_secure": {
774+
"n": 12,
775+
"mean": -0.05000000000000001,
776+
"ci95": [
777+
-0.11666666666666668,
778+
0.010416666666666663
779+
]
780+
},
781+
"csb_sdlc_test": {
782+
"n": 18,
783+
"mean": -0.011296296296296306,
784+
"ci95": [
785+
-0.09722222222222227,
786+
0.08314814814814814
787+
]
788+
},
789+
"csb_sdlc_understand": {
790+
"n": 10,
791+
"mean": 0.11483000000000002,
792+
"ci95": [
793+
-0.0415,
794+
0.31466000000000005
795+
]
796+
}
797+
}
610798
}
611799
}

0 commit comments

Comments
 (0)