sourcegraph
diff --git a/‎docs/BLOG_POST.md‎
Lines changed: 3 additions & 3 deletions b/‎docs/BLOG_POST.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎docs/analysis/analysis_refresh_tables_20260303.json‎
Lines changed: 188 additions & 0 deletions b/‎docs/analysis/analysis_refresh_tables_20260303.json‎
Lines changed: 188 additions & 0 deletions
@@ -19,7 +19,7 @@ I wanted to evaluate how coding agents perform in as close to an enterprise envi
 
 Anyway it took longer than I thought it would, but I did it. I made a real benchmark that's useful for me and hopefully others too. CodeScaleBench is a living benchmark (this is code for I'm still working on it and am vulnerable to scope creep) that is divided into two parts. CodeScaleBench-SDLC has 150 software engineering tasks spanning the full SDLC; it uses a patch based verifier method popularized by SWE-Bench and also has a corresponding ground_truth.json file produced by a curator agent for context retrieval metrics I'll talk about later. CodeScaleBench-Org has 220 software engineering tasks that are separated into development tasks that require organization and in many cases cross repository-wide codebase navigation and understanding; it uses what I call an 'artifact' verifier where it produces an 'answer.json' file that is compared with the curator agent's solution. I built the benchmark framework, the evaluation pipeline, the ground truth system, and the statistical analysis layer using Claude Code et al. across ~1000 conversation sessions over about a month.
 
-Some initial findings (that'll be expanded on later): on the current analysis snapshot (generated March 3, 2026), metrics are computed by averaging multiple runs per task/config first, then pairing those per-task means. That yields a paired reward delta of **+0.036** for CSB-SDLC, **+0.034** for CSB-Org, and **+0.035** overall across **370** baseline/MCP task pairs, with reward-delta variance **0.048985**. The highest positive suite deltas are `csb_org_incident` (+0.113), `csb_org_security` (+0.106), `csb_sdlc_understand` (+0.115), and `csb_sdlc_refactor` (+0.103). Timing and cost no longer show the earlier contradictory pattern: MCP is faster on wall-clock on average (367.11s baseline vs 330.89s MCP, −36.22s) and much faster on agent execution (−101.06s), but slightly more expensive (+$0.040/task, +13.49% on means).
+Some initial findings (that'll be expanded on later): on the current analysis snapshot (generated March 3, 2026), metrics are computed by averaging multiple runs per task/config first, then pairing those per-task means. That yields a paired reward delta of **+0.036** for CSB-SDLC (95% CI: [-0.008, +0.084]), **+0.034** for CSB-Org (95% CI: [+0.013, +0.057]), and **+0.035** overall (95% CI: [+0.013, +0.058]) across **370** baseline/MCP task pairs, with reward-delta variance **0.048985**. The highest positive suite deltas are `csb_org_incident` (+0.113), `csb_org_security` (+0.106), `csb_sdlc_understand` (+0.115), and `csb_sdlc_refactor` (+0.103). Timing and cost no longer show the earlier contradictory pattern: MCP is faster on wall-clock on average (367.11s baseline vs 330.89s MCP, −36.22s) and much faster on agent execution (−101.06s), but slightly more expensive (+$0.040/task, +13.49% on means).
 
 And by the way, building a benchmark for coding agents while using coding agents is a fun way to find new failure modes. We all know agents are sneaky and mysterious genies, and that's also why I think benchmark results should ship with full agent transcripts for auditing (talking about that later, I know I'm asking a lot of you but I promise if you like benchmarks this is interesting and also explains why you read this far).
 
@@ -69,7 +69,7 @@ Breaking it down by SDLC element (which is how I designed this side of the bench
 | debug | 18 | -0.019 |
 | secure | 7 | **-0.071** |
 
-SDLC total: mean paired delta **+0.036** (n=150 paired tasks).
+SDLC total: mean paired delta **+0.036** (n=150 paired tasks, 95% CI: [-0.008, +0.084]).
 
 ## Where Sourcegraph MCP Wins
 
@@ -89,7 +89,7 @@ From that table above, the largest SDLC gains in this snapshot are design (+0.14
 | platform | 18 | -0.015 |
 | crossrepo | 14 | -0.027 |
 
-Org total: mean paired delta **+0.034** (n=220 paired tasks). When the agent needs to find information scattered across multiple repos, MCP tools still help overall in this snapshot.
+Org total: mean paired delta **+0.034** (n=220 paired tasks, 95% CI: [+0.013, +0.057]). When the agent needs to find information scattered across multiple repos, MCP tools still help overall in this snapshot.
 
 The biggest effects are on security (+0.113) and incident debugging (+0.108). These are the tasks that look most like real enterprise work: tracing a vulnerability across a dozen repos, mapping error paths across microservices, figuring out mysterious (haunted?) codebases.
 
 
@@ -607,5 +607,193 @@
       "delta_cost_mean_usd": 0.0956,
       "delta_cost_variance": 0.022037
     }
+  },
+  "bootstrap_cis": {
+    "overall": {
+      "n": 370,
+      "mean": 0.03488056628056628,
+      "ci95": [
+        0.012977608322608318,
+        0.05788240883740884
+      ]
+    },
+    "sdlc": {
+      "n": 150,
+      "mean": 0.036298952380952376,
+      "ci95": [
+        -0.008323354497354503,
+        0.08350213756613756
+      ]
+    },
+    "org": {
+      "n": 220,
+      "mean": 0.033913484848484846,
+      "ci95": [
+        0.01331424242424242,
+        0.05709583333333334
+      ]
+    },
+    "by_suite": {
+      "csb_org_compliance": {
+        "n": 18,
+        "mean": 0.015322222222222221,
+        "ci95": [
+          -0.03528888888888889,
+          0.07069444444444442
+        ]
+      },
+      "csb_org_crossorg": {
+        "n": 15,
+        "mean": 0.02519333333333334,
+        "ci95": [
+          -0.009408888888888899,
+          0.057195555555555555
+        ]
+      },
+      "csb_org_crossrepo": {
+        "n": 14,
+        "mean": -0.02416428571428573,
+        "ci95": [
+          -0.05576904761904764,
+          0.007314285714285689
+        ]
+      },
+      "csb_org_crossrepo_tracing": {
+        "n": 22,
+        "mean": 0.0514,
+        "ci95": [
+          -0.004018181818181809,
+          0.1406939393939394
+        ]
+      },
+      "csb_org_domain": {
+        "n": 20,
+        "mean": -0.01652333333333334,
+        "ci95": [
+          -0.05229166666666668,
+          0.020853333333333324
+        ]
+      },
+      "csb_org_incident": {
+        "n": 20,
+        "mean": 0.11251,
+        "ci95": [
+          0.02462166666666666,
+          0.23049833333333333
+        ]
+      },
+      "csb_org_migration": {
+        "n": 26,
+        "mean": 0.03808782051282053,
+        "ci95": [
+          -0.008658333333333317,
+          0.10059038461538462
+        ]
+      },
+      "csb_org_onboarding": {
+        "n": 28,
+        "mean": 0.008305357142857146,
+        "ci95": [
+          -0.050334523809523815,
+          0.07817321428571429
+        ]
+      },
+      "csb_org_org": {
+        "n": 15,
+        "mean": 0.0568088888888889,
+        "ci95": [
+          -0.00670888888888887,
+          0.11804444444444445
+        ]
+      },
+      "csb_org_platform": {
+        "n": 18,
+        "mean": -0.02871481481481483,
+        "ci95": [
+          -0.08001851851851854,
+          0.011316666666666654
+        ]
+      },
+      "csb_org_security": {
+        "n": 24,
+        "mean": 0.10570555555555557,
+        "ci95": [
+          0.02496736111111111,
+          0.2102277777777778
+        ]
+      },
+      "csb_sdlc_debug": {
+        "n": 18,
+        "mean": -0.037222222222222226,
+        "ci95": [
+          -0.09092592592592592,
+          0.02999999999999999
+        ]
+      },
+      "csb_sdlc_design": {
+        "n": 14,
+        "mean": 0.05137290249433107,
+        "ci95": [
+          -0.1050613378684807,
+          0.21310861678004533
+        ]
+      },
+      "csb_sdlc_document": {
+        "n": 13,
+        "mean": 0.041538461538461524,
+        "ci95": [
+          -0.003846153846153858,
+          0.08999999999999998
+        ]
+      },
+      "csb_sdlc_feature": {
+        "n": 23,
+        "mean": 0.013019323671497577,
+        "ci95": [
+          -0.1134057971014493,
+          0.16041062801932365
+        ]
+      },
+      "csb_sdlc_fix": {
+        "n": 26,
+        "mean": 0.0986363247863248,
+        "ci95": [
+          0.016485470085470084,
+          0.19674786324786323
+        ]
+      },
+      "csb_sdlc_refactor": {
+        "n": 16,
+        "mean": 0.10291666666666664,
+        "ci95": [
+          -0.15187500000000004,
+          0.34249999999999997
+        ]
+      },
+      "csb_sdlc_secure": {
+        "n": 12,
+        "mean": -0.05000000000000001,
+        "ci95": [
+          -0.11666666666666668,
+          0.010416666666666663
+        ]
+      },
+      "csb_sdlc_test": {
+        "n": 18,
+        "mean": -0.011296296296296306,
+        "ci95": [
+          -0.09722222222222227,
+          0.08314814814814814
+        ]
+      },
+      "csb_sdlc_understand": {
+        "n": 10,
+        "mean": 0.11483000000000002,
+        "ci95": [
+          -0.0415,
+          0.31466000000000005
+        ]
+      }
+    }
   }
 }