Skip to content

Commit a6351e5

Browse files
sjarmakclaude
andcommitted
feat: unified dual-score benchmark (benchmarks/csb/) with 275 tasks in 9 merged suites
Merge 20 suites into 9 thematic suites under benchmarks/csb/. Agent always produces both direct file edits AND answer.json. Verifier writes reward_direct.txt + reward_artifact.txt independently. Original csb_sdlc_* and csb_org_* directories untouched. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 1cbeee5 commit a6351e5

File tree

5,153 files changed

+601011
-39
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

5,153 files changed

+601011
-39
lines changed

agents/claude_baseline_agent.py

Lines changed: 44 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,24 @@
7777
- ✅ `cargo check -p crate 2>&1 | head -100` — `head` returns as soon as it has enough output
7878
- ✅ Run long builds in the background and check output later
7979
80-
**If a command takes more than 60 seconds, it is probably too broad.** Kill it, narrow the scope, and retry."""
80+
**If a command takes more than 60 seconds, it is probably too broad.** Kill it, narrow the scope, and retry.
81+
82+
## OUTPUT ARTIFACT
83+
84+
After completing your work, also write `/workspace/answer.json` summarizing what you did:
85+
```json
86+
{
87+
"analysis": {
88+
"summary": "Brief description of your approach",
89+
"files_examined": [{"path": "file.ext", "description": "why you looked at it"}],
90+
"reasoning": "Detailed explanation or analysis"
91+
},
92+
"changes": [
93+
{"file": "path.ext", "description": "what you changed", "diff": "unified diff"}
94+
]
95+
}
96+
```
97+
Include `changes` with unified diffs for every file you modified. For analysis-only tasks, omit `changes` and focus on `analysis`. This artifact is scored independently from your direct edits."""
8198

8299

83100
# Concise Sourcegraph MCP tool reference for CLAUDE.md injection.
@@ -106,7 +123,7 @@
106123
# No truncation language — local source files simply aren't present.
107124
# 25% shorter than V4: removes Workflows, Output Formatting, Common Mistakes, Query Patterns.
108125
# {repo_scope} is replaced at runtime with the target repository filter.
109-
# {workflow_tail} is replaced with edit+test steps (direct) or produce-artifact step (artifact_full).
126+
# {workflow_tail} is replaced with edit+test+answer.json steps (always both direct and artifact).
110127
V5_PREAMBLE_TEMPLATE = """# IMPORTANT: Source Code Access
111128
112129
**Local source files are not present.** Your workspace does not contain source code. You **MUST** use Sourcegraph MCP tools to discover, read, and understand code before making any changes.
@@ -643,35 +660,30 @@ def create_run_agent_commands(self, instruction: str) -> list[ExecInput]:
643660
)
644661
repo_scope += branch_instructions
645662

646-
# Workflow steps 3-4 vary by config: direct configs edit+test
647-
# locally, artifact configs produce diffs as output artifacts.
648-
if mcp_type == "artifact_full":
649-
workflow_tail = (
650-
"3. **Produce answer.json** — Write ALL output to "
651-
"`/workspace/answer.json` with this structure:\n"
652-
" ```json\n"
653-
" {\n"
654-
' "analysis": {\n'
655-
' "summary": "Brief description of your approach",\n'
656-
' "files_examined": [{"path": "file.ext", "description": "..."}],\n'
657-
' "reasoning": "Detailed explanation or analysis"\n'
658-
" },\n"
659-
' "changes": [\n'
660-
' {"file": "path.ext", "description": "...", "diff": "unified diff"}\n'
661-
" ]\n"
662-
" }\n"
663-
" ```\n"
664-
" Omit `changes` if the task is analysis-only. "
665-
"Do NOT edit source files directly — produce diffs in "
666-
"`changes[]` instead."
667-
)
668-
else:
669-
workflow_tail = (
670-
"3. **Edit locally** — Use Edit, Write, and Bash to "
671-
"create or modify files in your working directory\n"
672-
"4. **Verify locally** — Run tests with Bash to check "
673-
"your changes"
674-
)
663+
# Workflow steps 3-5: agent always does BOTH direct edits AND
664+
# produces answer.json. The verifier scores both independently.
665+
workflow_tail = (
666+
"3. **Edit locally** — Use Edit, Write, and Bash to "
667+
"create or modify files in your working directory\n"
668+
"4. **Verify locally** — Run tests with Bash to check "
669+
"your changes\n"
670+
"5. **Produce answer.json** — After completing your edits, "
671+
"also write `/workspace/answer.json` summarizing your work:\n"
672+
" ```json\n"
673+
" {\n"
674+
' "analysis": {\n'
675+
' "summary": "Brief description of your approach",\n'
676+
' "files_examined": [{"path": "file.ext", "description": "..."}],\n'
677+
' "reasoning": "Detailed explanation or analysis"\n'
678+
" },\n"
679+
' "changes": [\n'
680+
' {"file": "path.ext", "description": "...", "diff": "unified diff"}\n'
681+
" ]\n"
682+
" }\n"
683+
" ```\n"
684+
" Include `changes` with unified diffs for every file you modified. "
685+
"For analysis-only tasks, omit `changes` and focus on `analysis`."
686+
)
675687

676688
mcp_preamble = V5_PREAMBLE_TEMPLATE.format(
677689
repo_scope=repo_scope, workflow_tail=workflow_tail
@@ -830,12 +842,7 @@ def create_run_agent_commands(self, instruction: str) -> list[ExecInput]:
830842
else:
831843
repo_filter_system = "Use list_repos to discover available repositories first."
832844

833-
if mcp_type == "artifact_full":
834-
mcp_system_prompt = f"""IMPORTANT: Local source files are not present. You MUST use Sourcegraph MCP tools to discover and read code. Write ALL output to /workspace/answer.json with "analysis" (summary, files_examined, reasoning) and optional "changes" (file, description, diff) arrays. Do NOT edit source files directly.
835-
836-
{repo_filter_system}"""
837-
else:
838-
mcp_system_prompt = f"""IMPORTANT: Local source files are not present. You MUST use Sourcegraph MCP tools to discover and read code, then create or edit local files based on what you learn. Run tests locally to verify your changes.
845+
mcp_system_prompt = f"""IMPORTANT: Local source files are not present. You MUST use Sourcegraph MCP tools to discover and read code, then create or edit local files based on what you learn. Run tests locally to verify your changes. After completing edits, also write /workspace/answer.json with "analysis" (summary, files_examined, reasoning) and "changes" (file, description, diff) arrays.
839846
840847
{repo_filter_system}"""
841848
system_prompt_append = EVALUATION_CONTEXT_PROMPT + "\n\n---\n\n" + mcp_system_prompt
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
FROM ubuntu:22.04
2+
3+
ENV DEBIAN_FRONTEND=noninteractive
4+
5+
# Base tools
6+
RUN apt-get update && apt-get install -y --no-install-recommends \
7+
git \
8+
ca-certificates \
9+
curl \
10+
python3 \
11+
golang-go \
12+
&& rm -rf /var/lib/apt/lists/*
13+
14+
# Create claude user BEFORE cloning so files are owned correctly from the
15+
# start. This avoids a post-clone chown -R layer that doubles image size
16+
# and takes 15-30 min on overlay2 (copy-on-write duplicates every inode).
17+
RUN adduser --disabled-password --gecos '' claude 2>/dev/null || true
18+
RUN mkdir -p /workspace /logs/agent /logs/verifier && \
19+
chown -R claude:claude /workspace /logs
20+
21+
# Clone as claude — files land claude-owned, no separate chown layer needed.
22+
USER claude
23+
WORKDIR /workspace
24+
# Clone local checkout repos (baseline config: agent has local access to these)
25+
RUN git clone --depth 1 https://github.com/sg-evals/kubernetes--v1.32.0 /workspace/kubernetes--v1.32.0
26+
RUN git clone --depth 1 https://github.com/sg-evals/client-go--v0.32.0 /workspace/client-go--v0.32.0
27+
RUN git clone --depth 1 https://github.com/sg-evals/api--v0.32.0 /workspace/api--v0.32.0
28+
RUN git clone --depth 1 https://github.com/sg-evals/etcd-io-etcd /workspace/etcd-io-etcd
29+
30+
# Initialize git identity for agent commits
31+
RUN git config --global user.email "agent@example.com" && \
32+
git config --global user.name "Agent" && \
33+
git config --global safe.directory '*'
34+
35+
# Switch back to root for Harbor's runtime setup
36+
USER root
37+
38+
ENTRYPOINT []
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
# ccx-config-trace-010 — artifact_baseline variant
2+
# Baseline with local code + artifact mode (verifier parses answer.json).
3+
4+
FROM ubuntu:22.04
5+
6+
ENV DEBIAN_FRONTEND=noninteractive
7+
8+
# Base tools
9+
RUN apt-get update && apt-get install -y --no-install-recommends \
10+
git \
11+
ca-certificates \
12+
curl \
13+
python3 \
14+
golang-go \
15+
&& rm -rf /var/lib/apt/lists/*
16+
17+
# Create claude user BEFORE cloning so files are owned correctly from the
18+
# start. This avoids a post-clone chown -R layer that doubles image size
19+
# and takes 15-30 min on overlay2 (copy-on-write duplicates every inode).
20+
RUN adduser --disabled-password --gecos '' claude 2>/dev/null || true
21+
RUN mkdir -p /workspace /logs/agent /logs/verifier && \
22+
chown -R claude:claude /workspace /logs
23+
24+
# Clone as claude — files land claude-owned, no separate chown layer needed.
25+
USER claude
26+
WORKDIR /workspace
27+
# Clone local checkout repos (baseline config: agent has local access to these)
28+
RUN git clone --depth 1 https://github.com/sg-evals/kubernetes--v1.32.0 /workspace/kubernetes--v1.32.0
29+
RUN git clone --depth 1 https://github.com/sg-evals/client-go--v0.32.0 /workspace/client-go--v0.32.0
30+
RUN git clone --depth 1 https://github.com/sg-evals/api--v0.32.0 /workspace/api--v0.32.0
31+
RUN git clone --depth 1 https://github.com/sg-evals/etcd-io-etcd /workspace/etcd-io-etcd
32+
33+
# Initialize git identity for agent commits
34+
RUN git config --global user.email "agent@example.com" && \
35+
git config --global user.name "Agent" && \
36+
git config --global safe.directory '*'
37+
38+
# Switch back to root for Harbor's runtime setup
39+
USER root
40+
41+
# Mark artifact-only mode — verifier parses answer.json
42+
RUN touch /tmp/.artifact_only_mode
43+
44+
ENTRYPOINT []
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# ccx-config-trace-010 — artifact_only variant
2+
# No local repo clone — agent uses Sourcegraph MCP exclusively for code access.
3+
# Agent produces answer.json artifact; verifier scores the artifact.
4+
5+
FROM ubuntu:22.04
6+
7+
ENV DEBIAN_FRONTEND=noninteractive
8+
ENV SOURCEGRAPH_REPOS="sg-evals/api--v0.32.0,sg-evals/client-go--v0.32.0,sg-evals/etcd-io-etcd,sg-evals/kubernetes--v1.32.0"
9+
10+
RUN apt-get update && apt-get install -y --no-install-recommends \
11+
git \
12+
ca-certificates \
13+
python3 \
14+
curl \
15+
&& rm -rf /var/lib/apt/lists/*
16+
17+
WORKDIR /workspace
18+
19+
# Empty workspace — agent discovers code via MCP tools only
20+
RUN git init && \
21+
git config user.email "agent@example.com" && \
22+
git config user.name "Agent" && \
23+
git config --global safe.directory '*'
24+
25+
# Create log directories
26+
RUN mkdir -p /logs/agent /logs/verifier
27+
28+
# Mark artifact-only mode — verifiers and eval scripts check this flag
29+
RUN touch /tmp/.artifact_only_mode
30+
31+
# Pre-create claude user and set ownership at build time.
32+
RUN (adduser --disabled-password --gecos '' claude 2>/dev/null || true) && \
33+
for d in /workspace /app /testbed /logs; do [ -d "$d" ] && chown -R claude:claude "$d"; done || true
34+
35+
ENTRYPOINT []
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
# ccx-config-trace-010 — sg_only_env variant (v2: clone-at-verify)
2+
# Empty workspace — agent uses Sourcegraph MCP for code access.
3+
# Verifier clones mirror(s) at verification time via clone manifest.
4+
5+
FROM ubuntu:22.04
6+
7+
ENV DEBIAN_FRONTEND=noninteractive
8+
ENV SOURCEGRAPH_REPOS="sg-evals/api--v0.32.0,sg-evals/client-go--v0.32.0,sg-evals/etcd-io-etcd,sg-evals/kubernetes--v1.32.0"
9+
10+
RUN apt-get update && apt-get install -y --no-install-recommends \
11+
git \
12+
ca-certificates \
13+
python3 \
14+
curl \
15+
&& rm -rf /var/lib/apt/lists/*
16+
17+
WORKDIR /workspace
18+
19+
# Empty git repo so agent can commit work
20+
RUN git init && \
21+
git config user.email "agent@example.com" && \
22+
git config user.name "Agent"
23+
24+
RUN mkdir -p /logs/agent /logs/verifier
25+
26+
# Clone manifest for verifier (clone-at-verify strategy)
27+
RUN echo '{"workdir":"/workspace","repos":[{"mirror":"sg-evals/kubernetes--v1.32.0","target_dir":"kubernetes--v1.32.0"},{"mirror":"sg-evals/client-go--v0.32.0","target_dir":"client-go--v0.32.0"},{"mirror":"sg-evals/api--v0.32.0","target_dir":"api--v0.32.0"},{"mirror":"sg-evals/etcd-io-etcd","target_dir":"etcd-io-etcd"}]}' > /tmp/.sg_only_clone_manifest.json
28+
29+
# Mark sg_only mode
30+
RUN touch /tmp/.sg_only_mode
31+
32+
# Pre-create claude user and set ownership at build time.
33+
RUN (adduser --disabled-password --gecos '' claude 2>/dev/null || true) && \
34+
for d in /workspace /app /testbed /logs; do [ -d "$d" ] && chown -R claude:claude "$d"; done || true
35+
36+
ENTRYPOINT []
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
# Stack Trace Symbol Resolution: rest.Config
2+
3+
## Your Task
4+
5+
Find the repository and file path where the `Config` struct is defined (not vendored) in the `rest` package of `k8s.io/client-go`. What is the exact Go package import path?
6+
7+
## Context
8+
9+
You are working on a codebase task involving repos from the crossrepo tracing domain.
10+
11+
## Available Resources
12+
13+
The local `/workspace/` directory contains: sg-evals/kubernetes--v1.32.0, sg-evals/client-go--v0.32.0, sg-evals/api--v0.32.0, sg-evals/etcd-io-etcd.
14+
15+
16+
## Output Format
17+
18+
Use the published task contract:
19+
20+
- `TASK_WORKDIR=/workspace`
21+
- `TASK_REPO_ROOT=/workspace`
22+
- `TASK_OUTPUT=/workspace/answer.json`
23+
24+
Create a file at `TASK_OUTPUT` (`/workspace/answer.json`) with your findings in the following structure:
25+
26+
```json
27+
{
28+
"files": [
29+
{"repo": "repo-name", "path": "relative/path/to/file.go"}
30+
],
31+
"symbols": [
32+
{"repo": "repo-name", "path": "relative/path/to/file.go", "symbol": "SymbolName"}
33+
],
34+
"chain": [
35+
{"repo": "repo-name", "path": "relative/path/to/file.go", "symbol": "FunctionName"}
36+
],
37+
"text": "Narrative explanation of your findings, citing repos and file paths."
38+
}
39+
```
40+
41+
Include only the fields relevant to this task. Your answer is evaluated against a closed-world oracle — completeness matters.
42+
43+
## Evaluation
44+
45+
Your answer will be scored on:
46+
- **File recall and precision**: Did you find all relevant files?

0 commit comments

Comments
 (0)