Skip to content

Commit 20b2411

Browse files
sjarmakclaude
andcommitted
docs: refresh benchmark catalog docs and add technical report
- README: update MCP-unique section (6 suites/12 tasks -> 11 suites/81 tasks), total 251, repo structure reflects all 11 ccb_mcp_* suites - benchmarks/README: replace 3 removed protonmail tasks with new envoy and terraform fix tasks - TASK_CATALOG: same protonmail -> envoy/terraform replacement - Dockerfiles: clone-as-claude migration for 84 MCP-unique baselines - Promote fix_haiku_20260226_new3tasks to official Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 7cc0246 commit 20b2411

File tree

88 files changed

+957
-695
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

88 files changed

+957
-695
lines changed

README.md

Lines changed: 26 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -81,19 +81,24 @@ Eight suites organized by software development lifecycle phase:
8181

8282
## MCP-Unique Suites (Org-Scale Context Retrieval)
8383

84-
Six additional suites measure cross-repo discovery, symbol resolution, dependency tracing, and deep-search-driven investigation in polyrepo environments.
84+
Eleven additional suites measure cross-repo discovery, symbol resolution, dependency tracing, and deep-search-driven investigation in polyrepo environments.
8585

8686
| Suite | Category | Tasks | Description |
8787
|-------|----------|------:|-------------|
88-
| `ccb_mcp_crossrepo_tracing` | A: Dependency Tracing | 3 | Cross-repo dependency chains, blast radius, symbol resolution |
89-
| `ccb_mcp_security` | B: Vulnerability Remediation | 2 | CVE mapping, missing auth middleware across repos |
90-
| `ccb_mcp_incident` | D: Incident Debugging | 1 | Error-to-code-path tracing across microservices |
91-
| `ccb_mcp_onboarding` | E: Onboarding & Comprehension | 3 | API consumption mapping, end-to-end flow, architecture maps |
92-
| `ccb_mcp_crossorg` | G: Cross-Org Discovery | 2 | Interface implementations and authoritative repo identification across orgs |
93-
| `ccb_mcp_platform` | J: Platform Knowledge | 1 | Service template discovery and tribal knowledge |
94-
| **Total** | | **12** | |
95-
96-
The table above shows the 12 tasks evaluated in official runs. The full MCP-unique catalog has 20 tasks across 8 suites (including compliance and migration, pending first runs). **Combined catalog total: 190 tasks** (170 SDLC + 20 MCP-unique).
88+
| `ccb_mcp_crossrepo_tracing` | A: Dependency Tracing | 9 | Cross-repo dependency chains, blast radius, symbol resolution |
89+
| `ccb_mcp_security` | B: Vulnerability Remediation | 10 | CVE mapping, missing auth middleware across repos |
90+
| `ccb_mcp_migration` | C: Framework Migration | 7 | API migrations, breaking changes across repos |
91+
| `ccb_mcp_incident` | D: Incident Debugging | 11 | Error-to-code-path tracing across microservices |
92+
| `ccb_mcp_onboarding` | E: Onboarding & Comprehension | 11 | API consumption mapping, end-to-end flow, architecture maps |
93+
| `ccb_mcp_compliance` | F: Compliance | 7 | Standards adherence, audit, and provenance workflows |
94+
| `ccb_mcp_crossorg` | G: Cross-Org Discovery | 5 | Interface implementations and authoritative repo identification across orgs |
95+
| `ccb_mcp_domain` | H: Domain Lineage | 10 | Config propagation, architecture patterns, domain analysis |
96+
| `ccb_mcp_org` | I: Organizational Context | 5 | Agentic discovery, org-wide coding correctness |
97+
| `ccb_mcp_platform` | J: Platform Knowledge | 5 | Service template discovery and tribal knowledge |
98+
| `ccb_mcp_crossrepo` | Legacy | 1 | Cross-repo discovery (compatibility) |
99+
| **Total** | | **81** | |
100+
101+
**Combined catalog total: 251 tasks** (170 SDLC + 81 MCP-unique). Of these, 212 are fully paired (baseline + MCP results) in official runs; the remaining 39 MCP-unique tasks have MCP results but are missing baselines.
97102

98103
Both baseline and MCP-Full agents have access to **all repos** in each task's fixture. The only difference is the method: baseline reads code locally, MCP-Full uses Sourcegraph MCP tools (local code is truncated). This ensures we measure whether MCP tools help agents work better — not whether MCP can access repos the baseline can't.
99104

@@ -133,12 +138,17 @@ benchmarks/ # Task definitions organized by SDLC phase + MCP-unique
133138
ccb_secure/ # Security & Compliance (20 tasks)
134139
ccb_test/ # Testing & QA (20 tasks)
135140
ccb_understand/ # Requirements & Discovery (20 tasks)
136-
ccb_mcp_crossrepo_tracing/ # MCP-unique: cross-repo dependency tracing (3 tasks)
137-
ccb_mcp_security/ # MCP-unique: vulnerability remediation (2 tasks)
138-
ccb_mcp_incident/ # MCP-unique: incident debugging (1 task)
139-
ccb_mcp_onboarding/ # MCP-unique: onboarding & comprehension (3 tasks)
140-
ccb_mcp_crossorg/ # MCP-unique: cross-org discovery (2 tasks)
141-
ccb_mcp_platform/ # MCP-unique: platform knowledge (1 task)
141+
ccb_mcp_compliance/ # MCP-unique: compliance & audit (7 tasks)
142+
ccb_mcp_crossorg/ # MCP-unique: cross-org discovery (5 tasks)
143+
ccb_mcp_crossrepo/ # MCP-unique: legacy cross-repo (1 task)
144+
ccb_mcp_crossrepo_tracing/ # MCP-unique: dependency tracing (9 tasks)
145+
ccb_mcp_domain/ # MCP-unique: domain lineage (10 tasks)
146+
ccb_mcp_incident/ # MCP-unique: incident debugging (11 tasks)
147+
ccb_mcp_migration/ # MCP-unique: framework migration (7 tasks)
148+
ccb_mcp_onboarding/ # MCP-unique: onboarding (11 tasks)
149+
ccb_mcp_org/ # MCP-unique: org context (5 tasks)
150+
ccb_mcp_platform/ # MCP-unique: platform knowledge (5 tasks)
151+
ccb_mcp_security/ # MCP-unique: vulnerability remediation (10 tasks)
142152
configs/ # Run configs and task selection
143153
_common.sh # Shared infra: token refresh, parallel execution, multi-account
144154
sdlc_suite_2config.sh # Generic SDLC runner (used by phase wrappers below)

benchmarks/README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -126,9 +126,9 @@ Diagnosing and fixing real bugs across production codebases (SWE-bench Pro, PyTo
126126
| `openlibrary-fntocli-adapter-fix-001` | FnToCLI adapter list inputs paths |
127127
| `openlibrary-search-query-fix-001` | Work search query parsing normalization |
128128
| `openlibrary-solr-boolean-fix-001` | Solr boolean clause limit alignment |
129-
| `protonmail-conv-testhooks-fix-001` | Conversation message view test hooks |
130-
| `protonmail-dropdown-sizing-fix-001` | Dropdown unified sizing configuration |
131-
| `protonmail-holiday-calendar-fix-001` | Public holiday calendar management |
129+
| `envoy-dfp-host-leak-fix-001` | Dynamic forward proxy host header memory leak |
130+
| `envoy-udp-proxy-cds-fix-001` | UDP proxy crash on dynamic CDS/EDS cluster update |
131+
| `terraform-plan-null-unknown-fix-001` | Terraform plan null/unknown value rendering |
132132
| `pytorch-cudnn-version-fix-001` | Expose cuDNN runtime version |
133133
| `pytorch-dynamo-keyerror-fix-001` | Fix dynamo keyerror and attribute |
134134
| `pytorch-release-210-fix-001` | Release 2.10 bug fix changes |

benchmarks/ccb_mcp_compliance/ccx-compliance-051/environment/Dockerfile

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -11,22 +11,22 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
1111
\
1212
&& rm -rf /var/lib/apt/lists/*
1313

14+
# Create claude user BEFORE any work so files are owned correctly from the
15+
# start. This avoids a post-clone chown -R layer that doubles image size
16+
# and takes 15-30 min on overlay2 (copy-on-write duplicates every inode).
17+
RUN adduser --disabled-password --gecos '' claude 2>/dev/null || true
18+
RUN mkdir -p /workspace /logs/agent /logs/verifier && \
19+
chown -R claude:claude /workspace /logs
20+
21+
USER claude
1422
WORKDIR /workspace
1523

16-
# Clone local checkout repos (baseline config: agent has local access to these)
17-
# No local checkout repos specified for this fixture
18-
1924
# Initialize git identity for agent commits
2025
RUN git config --global user.email "agent@example.com" && \
2126
git config --global user.name "Agent" && \
2227
git config --global safe.directory '*'
2328

24-
# Create log directories
25-
RUN mkdir -p /logs/agent /logs/verifier
26-
27-
# Pre-create claude user and set ownership at build time so Harbor's
28-
# runtime chown is a no-op (avoids 15-30 min delay on large repos).
29-
RUN (adduser --disabled-password --gecos '' claude 2>/dev/null || true) && \
30-
for d in /workspace /app /testbed /logs; do [ -d "$d" ] && chown -R claude:claude "$d"; done || true
29+
# Switch back to root for Harbor's runtime setup
30+
USER root
3131

3232
ENTRYPOINT []

benchmarks/ccb_mcp_compliance/ccx-compliance-052/environment/Dockerfile

Lines changed: 11 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -11,8 +11,16 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
1111
g++ make \
1212
&& rm -rf /var/lib/apt/lists/*
1313

14+
# Create claude user BEFORE cloning so files are owned correctly from the
15+
# start. This avoids a post-clone chown -R layer that doubles image size
16+
# and takes 15-30 min on overlay2 (copy-on-write duplicates every inode).
17+
RUN adduser --disabled-password --gecos '' claude 2>/dev/null || true
18+
RUN mkdir -p /workspace /logs/agent /logs/verifier && \
19+
chown -R claude:claude /workspace /logs
20+
21+
# Clone as claude — files land claude-owned, no separate chown layer needed.
22+
USER claude
1423
WORKDIR /workspace
15-
1624
# Clone local checkout repos (baseline config: agent has local access to these)
1725
RUN git clone --depth 1 https://github.com/sg-evals/envoy--v1.31.2 /workspace/envoy--v1.31.2
1826
RUN git clone --depth 1 https://github.com/sg-evals/data-plane-api--84e84367 /workspace/data-plane-api--84e84367
@@ -24,12 +32,7 @@ RUN git config --global user.email "agent@example.com" && \
2432
git config --global user.name "Agent" && \
2533
git config --global safe.directory '*'
2634

27-
# Create log directories
28-
RUN mkdir -p /logs/agent /logs/verifier
29-
30-
# Pre-create claude user and set ownership at build time so Harbor's
31-
# runtime chown is a no-op (avoids 15-30 min delay on large repos).
32-
RUN (adduser --disabled-password --gecos '' claude 2>/dev/null || true) && \
33-
for d in /workspace /app /testbed /logs; do [ -d "$d" ] && chown -R claude:claude "$d"; done || true
35+
# Switch back to root for Harbor's runtime setup
36+
USER root
3437

3538
ENTRYPOINT []

benchmarks/ccb_mcp_compliance/ccx-compliance-053/environment/Dockerfile

Lines changed: 11 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -11,8 +11,16 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
1111
default-jdk \
1212
&& rm -rf /var/lib/apt/lists/*
1313

14+
# Create claude user BEFORE cloning so files are owned correctly from the
15+
# start. This avoids a post-clone chown -R layer that doubles image size
16+
# and takes 15-30 min on overlay2 (copy-on-write duplicates every inode).
17+
RUN adduser --disabled-password --gecos '' claude 2>/dev/null || true
18+
RUN mkdir -p /workspace /logs/agent /logs/verifier && \
19+
chown -R claude:claude /workspace /logs
20+
21+
# Clone as claude — files land claude-owned, no separate chown layer needed.
22+
USER claude
1423
WORKDIR /workspace
15-
1624
# Clone local checkout repos (baseline config: agent has local access to these)
1725
RUN git clone --depth 1 https://github.com/sg-evals/kafka--0753c489 /workspace/kafka--0753c489
1826
RUN git clone --depth 1 https://github.com/sg-evals/flink--0cc95fcc /workspace/flink--0cc95fcc
@@ -23,12 +31,7 @@ RUN git config --global user.email "agent@example.com" && \
2331
git config --global user.name "Agent" && \
2432
git config --global safe.directory '*'
2533

26-
# Create log directories
27-
RUN mkdir -p /logs/agent /logs/verifier
28-
29-
# Pre-create claude user and set ownership at build time so Harbor's
30-
# runtime chown is a no-op (avoids 15-30 min delay on large repos).
31-
RUN (adduser --disabled-password --gecos '' claude 2>/dev/null || true) && \
32-
for d in /workspace /app /testbed /logs; do [ -d "$d" ] && chown -R claude:claude "$d"; done || true
34+
# Switch back to root for Harbor's runtime setup
35+
USER root
3336

3437
ENTRYPOINT []

benchmarks/ccb_mcp_compliance/ccx-compliance-057-ds/environment/Dockerfile

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,16 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
1010
python3 \
1111
&& rm -rf /var/lib/apt/lists/*
1212

13-
WORKDIR /workspace
13+
# Create claude user BEFORE cloning so files are owned correctly from the
14+
# start. This avoids a post-clone chown -R layer that doubles image size
15+
# and takes 15-30 min on overlay2 (copy-on-write duplicates every inode).
16+
RUN adduser --disabled-password --gecos '' claude 2>/dev/null || true
17+
RUN mkdir -p /workspace /logs/agent /logs/verifier && \
18+
chown -R claude:claude /workspace /logs
1419

20+
# Clone as claude — files land claude-owned, no separate chown layer needed.
21+
USER claude
22+
WORKDIR /workspace
1523
# Clone all fixture repos (baseline has full local access to every repo)
1624
RUN git clone --depth 1 https://github.com/sg-evals/grafana--v11.4.0.git /workspace/grafana
1725
RUN git clone --depth 1 https://github.com/sg-evals/loki--v3.3.4.git /workspace/loki
@@ -21,7 +29,7 @@ RUN git config --global user.email "agent@example.com" && \
2129
git config --global user.name "Agent" && \
2230
git config --global safe.directory '*'
2331

24-
# Create log directories
25-
RUN mkdir -p /logs/agent /logs/verifier
32+
# Switch back to root for Harbor's runtime setup
33+
USER root
2634

2735
ENTRYPOINT []

benchmarks/ccb_mcp_compliance/ccx-compliance-057/environment/Dockerfile

Lines changed: 11 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -11,8 +11,16 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
1111
golang-go \
1212
&& rm -rf /var/lib/apt/lists/*
1313

14+
# Create claude user BEFORE cloning so files are owned correctly from the
15+
# start. This avoids a post-clone chown -R layer that doubles image size
16+
# and takes 15-30 min on overlay2 (copy-on-write duplicates every inode).
17+
RUN adduser --disabled-password --gecos '' claude 2>/dev/null || true
18+
RUN mkdir -p /workspace /logs/agent /logs/verifier && \
19+
chown -R claude:claude /workspace /logs
20+
21+
# Clone as claude — files land claude-owned, no separate chown layer needed.
22+
USER claude
1423
WORKDIR /workspace
15-
1624
# Clone local checkout repos (baseline config: agent has local access to these)
1725
RUN git clone --depth 1 https://github.com/sg-evals/grafana--26d36ec /workspace/grafana--26d36ec
1826

@@ -21,12 +29,7 @@ RUN git config --global user.email "agent@example.com" && \
2129
git config --global user.name "Agent" && \
2230
git config --global safe.directory '*'
2331

24-
# Create log directories
25-
RUN mkdir -p /logs/agent /logs/verifier
26-
27-
# Pre-create claude user and set ownership at build time so Harbor's
28-
# runtime chown is a no-op (avoids 15-30 min delay on large repos).
29-
RUN (adduser --disabled-password --gecos '' claude 2>/dev/null || true) && \
30-
for d in /workspace /app /testbed /logs; do [ -d "$d" ] && chown -R claude:claude "$d"; done || true
32+
# Switch back to root for Harbor's runtime setup
33+
USER root
3134

3235
ENTRYPOINT []

benchmarks/ccb_mcp_compliance/ccx-compliance-115/environment/Dockerfile

Lines changed: 11 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -11,8 +11,16 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
1111
python3 python3-pip \
1212
&& rm -rf /var/lib/apt/lists/*
1313

14+
# Create claude user BEFORE cloning so files are owned correctly from the
15+
# start. This avoids a post-clone chown -R layer that doubles image size
16+
# and takes 15-30 min on overlay2 (copy-on-write duplicates every inode).
17+
RUN adduser --disabled-password --gecos '' claude 2>/dev/null || true
18+
RUN mkdir -p /workspace /logs/agent /logs/verifier && \
19+
chown -R claude:claude /workspace /logs
20+
21+
# Clone as claude — files land claude-owned, no separate chown layer needed.
22+
USER claude
1423
WORKDIR /workspace
15-
1624
# Clone local checkout repos (baseline config: agent has local access to these)
1725
RUN git clone --depth 1 https://github.com/sg-evals/django--674eda1c /workspace/django--674eda1c
1826

@@ -21,12 +29,7 @@ RUN git config --global user.email "agent@example.com" && \
2129
git config --global user.name "Agent" && \
2230
git config --global safe.directory '*'
2331

24-
# Create log directories
25-
RUN mkdir -p /logs/agent /logs/verifier
26-
27-
# Pre-create claude user and set ownership at build time so Harbor's
28-
# runtime chown is a no-op (avoids 15-30 min delay on large repos).
29-
RUN (adduser --disabled-password --gecos '' claude 2>/dev/null || true) && \
30-
for d in /workspace /app /testbed /logs; do [ -d "$d" ] && chown -R claude:claude "$d"; done || true
32+
# Switch back to root for Harbor's runtime setup
33+
USER root
3134

3235
ENTRYPOINT []

benchmarks/ccb_mcp_compliance/ccx-compliance-118/environment/Dockerfile

Lines changed: 11 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -11,8 +11,16 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
1111
python3 python3-pip \
1212
&& rm -rf /var/lib/apt/lists/*
1313

14+
# Create claude user BEFORE cloning so files are owned correctly from the
15+
# start. This avoids a post-clone chown -R layer that doubles image size
16+
# and takes 15-30 min on overlay2 (copy-on-write duplicates every inode).
17+
RUN adduser --disabled-password --gecos '' claude 2>/dev/null || true
18+
RUN mkdir -p /workspace /logs/agent /logs/verifier && \
19+
chown -R claude:claude /workspace /logs
20+
21+
# Clone as claude — files land claude-owned, no separate chown layer needed.
22+
USER claude
1423
WORKDIR /workspace
15-
1624
# Clone local checkout repos (baseline config: agent has local access to these)
1725
RUN git clone --depth 1 https://github.com/sg-evals/django--674eda1c /workspace/django--674eda1c
1826

@@ -21,12 +29,7 @@ RUN git config --global user.email "agent@example.com" && \
2129
git config --global user.name "Agent" && \
2230
git config --global safe.directory '*'
2331

24-
# Create log directories
25-
RUN mkdir -p /logs/agent /logs/verifier
26-
27-
# Pre-create claude user and set ownership at build time so Harbor's
28-
# runtime chown is a no-op (avoids 15-30 min delay on large repos).
29-
RUN (adduser --disabled-password --gecos '' claude 2>/dev/null || true) && \
30-
for d in /workspace /app /testbed /logs; do [ -d "$d" ] && chown -R claude:claude "$d"; done || true
32+
# Switch back to root for Harbor's runtime setup
33+
USER root
3134

3235
ENTRYPOINT []

benchmarks/ccb_mcp_compliance/ccx-compliance-124/environment/Dockerfile

Lines changed: 11 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -11,8 +11,16 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
1111
g++ make \
1212
&& rm -rf /var/lib/apt/lists/*
1313

14+
# Create claude user BEFORE cloning so files are owned correctly from the
15+
# start. This avoids a post-clone chown -R layer that doubles image size
16+
# and takes 15-30 min on overlay2 (copy-on-write duplicates every inode).
17+
RUN adduser --disabled-password --gecos '' claude 2>/dev/null || true
18+
RUN mkdir -p /workspace /logs/agent /logs/verifier && \
19+
chown -R claude:claude /workspace /logs
20+
21+
# Clone as claude — files land claude-owned, no separate chown layer needed.
22+
USER claude
1423
WORKDIR /workspace
15-
1624
# Clone local checkout repos (baseline config: agent has local access to these)
1725
RUN git clone --depth 1 https://github.com/sg-evals/firefox--871325b8 /workspace/firefox--871325b8
1826

@@ -21,12 +29,7 @@ RUN git config --global user.email "agent@example.com" && \
2129
git config --global user.name "Agent" && \
2230
git config --global safe.directory '*'
2331

24-
# Create log directories
25-
RUN mkdir -p /logs/agent /logs/verifier
26-
27-
# Pre-create claude user and set ownership at build time so Harbor's
28-
# runtime chown is a no-op (avoids 15-30 min delay on large repos).
29-
RUN (adduser --disabled-password --gecos '' claude 2>/dev/null || true) && \
30-
for d in /workspace /app /testbed /logs; do [ -d "$d" ] && chown -R claude:claude "$d"; done || true
32+
# Switch back to root for Harbor's runtime setup
33+
USER root
3134

3235
ENTRYPOINT []

0 commit comments

Comments
 (0)