Skip to content

Commit 11765ac

Browse files
committed
chore: sync benchmark instructions, preflight checks, and local ignore rules
1 parent 7931be6 commit 11765ac

File tree

15 files changed

+232
-124
lines changed

15 files changed

+232
-124
lines changed

.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,3 +43,7 @@ vendor/dependeval_repos/
4343
*.pem
4444
.claude/*
4545
!.claude/commands/
46+
47+
# Local benchmark agent/session state
48+
benchmarks/.claude/
49+
benchmarks/locobench-agent/.claude/

AGENTS.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,4 +79,6 @@ python3 scripts/generate_eval_report.py
7979
- `configs/_common.sh` - shared run infra (parallelism, token refresh, validation hooks)
8080
- `configs/*_2config.sh` - per-suite run launchers
8181
- `configs/validate_one_per_benchmark.sh --smoke-runtime` - quick no-agent runtime smoke (1 task per benchmark)
82+
- Smoke interpretation: `smoke_verifier_nonzero_with_reward` is acceptable in no-agent mode; use `--smoke-timeout-overrides "ccb_pytorch=900,ccb_tac=900,ccb_crossrepo=900"` for timeout-heavy suites.
83+
- Timeout diagnostics: `smoke_build_timeout` (image build phase) vs `smoke_verify_timeout` (verifier phase).
8284
- `scripts/promote_run.py` - staging to official promotion flow

CLAUDE.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,4 +79,6 @@ python3 scripts/generate_eval_report.py
7979
- `configs/_common.sh` - shared run infra (parallelism, token refresh, validation hooks)
8080
- `configs/*_2config.sh` - per-suite run launchers
8181
- `configs/validate_one_per_benchmark.sh --smoke-runtime` - quick no-agent runtime smoke (1 task per benchmark)
82+
- Smoke interpretation: `smoke_verifier_nonzero_with_reward` is acceptable in no-agent mode; use `--smoke-timeout-overrides "ccb_pytorch=900,ccb_tac=900,ccb_crossrepo=900"` for timeout-heavy suites.
83+
- Timeout diagnostics: `smoke_build_timeout` (image build phase) vs `smoke_verify_timeout` (verifier phase).
8284
- `scripts/promote_run.py` - staging to official promotion flow

benchmarks/ccb_docgen/docgen-k8s-apiserver-001/instruction.md

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -9,18 +9,18 @@ Produce a subsystem architecture and extension guide for the apiserver library w
99

1010
## Scope
1111

12-
Primary focus area: `staging/src/k8s.io/apiserver`.
13-
14-
Your document must explain component responsibilities, end-to-end flow, and extension/operational tradeoffs.
15-
16-
## Required Sections
17-
18-
1. **Subsystem Overview** purpose, boundaries, and upstream/downstream dependencies
19-
2. **Key Components** — major types/modules and their responsibilities
20-
3. **End-to-End Flow** — request/control flow with concrete file-backed references
21-
4. **Failure Modes & Tradeoffs** — common failures, limits, and design tradeoffs
22-
5. **Extension Points** — where behavior can be customized and associated risks
23-
6. **Source File Map** — list the most relevant files/directories used in your analysis
12+
Focus on the Kubernetes apiserver library subsystem.
13+
Your document must explain component responsibilities, end-to-end flow, and extension or operational tradeoffs.
14+
15+
## Content Expectations
16+
17+
Address all of the following in your own structure:
18+
- subsystem purpose, boundaries, and upstream/downstream dependencies
19+
- key components and how responsibilities are split
20+
- end-to-end control/request flow with concrete repository evidence
21+
- failure modes, limits, and design tradeoffs
22+
- extension points, customization hooks, and associated risks
23+
- a concise map of the most important files/directories used in your analysis
2424

2525
## Quality Bar
2626

benchmarks/ccb_docgen/docgen-k8s-applyconfig-001/instruction.md

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -9,18 +9,18 @@ Produce a deep guide for applyconfigurations and Server-Side Apply semantics, in
99

1010
## Scope
1111

12-
Primary focus area: `staging/src/k8s.io/client-go/applyconfigurations`.
13-
14-
Your document must explain component responsibilities, end-to-end flow, and extension/operational tradeoffs.
15-
16-
## Required Sections
17-
18-
1. **Subsystem Overview** purpose, boundaries, and upstream/downstream dependencies
19-
2. **Key Components** — major types/modules and their responsibilities
20-
3. **End-to-End Flow** — request/control flow with concrete file-backed references
21-
4. **Failure Modes & Tradeoffs** — common failures, limits, and design tradeoffs
22-
5. **Extension Points** — where behavior can be customized and associated risks
23-
6. **Source File Map** — list the most relevant files/directories used in your analysis
12+
Focus on the applyconfigurations and Server-Side Apply subsystem.
13+
Your document must explain component responsibilities, end-to-end flow, and extension or operational tradeoffs.
14+
15+
## Content Expectations
16+
17+
Address all of the following in your own structure:
18+
- subsystem purpose, boundaries, and upstream/downstream dependencies
19+
- key components and how responsibilities are split
20+
- end-to-end control/request flow with concrete repository evidence
21+
- failure modes, limits, and design tradeoffs
22+
- extension points, customization hooks, and associated risks
23+
- a concise map of the most important files/directories used in your analysis
2424

2525
## Quality Bar
2626

benchmarks/ccb_docgen/docgen-k8s-clientgo-001/instruction.md

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -9,18 +9,18 @@ Produce an advanced systems guide for client-go that explains API access, contro
99

1010
## Scope
1111

12-
Primary focus area: `staging/src/k8s.io/client-go`.
13-
14-
Your document must explain component responsibilities, end-to-end flow, and extension/operational tradeoffs.
15-
16-
## Required Sections
17-
18-
1. **Subsystem Overview** purpose, boundaries, and upstream/downstream dependencies
19-
2. **Key Components** — major types/modules and their responsibilities
20-
3. **End-to-End Flow** — request/control flow with concrete file-backed references
21-
4. **Failure Modes & Tradeoffs** — common failures, limits, and design tradeoffs
22-
5. **Extension Points** — where behavior can be customized and associated risks
23-
6. **Source File Map** — list the most relevant files/directories used in your analysis
12+
Focus on the client-go subsystem.
13+
Your document must explain component responsibilities, end-to-end flow, and extension or operational tradeoffs.
14+
15+
## Content Expectations
16+
17+
Address all of the following in your own structure:
18+
- subsystem purpose, boundaries, and upstream/downstream dependencies
19+
- key components and how responsibilities are split
20+
- end-to-end control/request flow with concrete repository evidence
21+
- failure modes, limits, and design tradeoffs
22+
- extension points, customization hooks, and associated risks
23+
- a concise map of the most important files/directories used in your analysis
2424

2525
## Quality Bar
2626

benchmarks/ccb_docgen/docgen-k8s-cm-001/instruction.md

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -9,18 +9,18 @@ Produce a subsystem architecture guide for kubelet container manager, including
99

1010
## Scope
1111

12-
Primary focus area: `pkg/kubelet/cm`.
13-
14-
Your document must explain component responsibilities, end-to-end flow, and extension/operational tradeoffs.
15-
16-
## Required Sections
17-
18-
1. **Subsystem Overview** purpose, boundaries, and upstream/downstream dependencies
19-
2. **Key Components** — major types/modules and their responsibilities
20-
3. **End-to-End Flow** — request/control flow with concrete file-backed references
21-
4. **Failure Modes & Tradeoffs** — common failures, limits, and design tradeoffs
22-
5. **Extension Points** — where behavior can be customized and associated risks
23-
6. **Source File Map** — list the most relevant files/directories used in your analysis
12+
Focus on the kubelet container manager subsystem.
13+
Your document must explain component responsibilities, end-to-end flow, and extension or operational tradeoffs.
14+
15+
## Content Expectations
16+
17+
Address all of the following in your own structure:
18+
- subsystem purpose, boundaries, and upstream/downstream dependencies
19+
- key components and how responsibilities are split
20+
- end-to-end control/request flow with concrete repository evidence
21+
- failure modes, limits, and design tradeoffs
22+
- extension points, customization hooks, and associated risks
23+
- a concise map of the most important files/directories used in your analysis
2424

2525
## Quality Bar
2626

benchmarks/ccb_docgen/docgen-k8s-fairqueuing-001/instruction.md

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -9,18 +9,18 @@ Produce an algorithmic deep-dive on APF QueueSet behavior, dispatch flow, and fa
99

1010
## Scope
1111

12-
Primary focus area: `staging/src/k8s.io/apiserver/pkg/util/flowcontrol/fairqueuing/queueset`.
13-
14-
Your document must explain component responsibilities, end-to-end flow, and extension/operational tradeoffs.
15-
16-
## Required Sections
17-
18-
1. **Subsystem Overview** purpose, boundaries, and upstream/downstream dependencies
19-
2. **Key Components** — major types/modules and their responsibilities
20-
3. **End-to-End Flow** — request/control flow with concrete file-backed references
21-
4. **Failure Modes & Tradeoffs** — common failures, limits, and design tradeoffs
22-
5. **Extension Points** — where behavior can be customized and associated risks
23-
6. **Source File Map** — list the most relevant files/directories used in your analysis
12+
Focus on the APF QueueSet subsystem.
13+
Your document must explain component responsibilities, end-to-end flow, and extension or operational tradeoffs.
14+
15+
## Content Expectations
16+
17+
Address all of the following in your own structure:
18+
- subsystem purpose, boundaries, and upstream/downstream dependencies
19+
- key components and how responsibilities are split
20+
- end-to-end control/request flow with concrete repository evidence
21+
- failure modes, limits, and design tradeoffs
22+
- extension points, customization hooks, and associated risks
23+
- a concise map of the most important files/directories used in your analysis
2424

2525
## Quality Bar
2626

benchmarks/ccb_security/sec-reach-002/instruction.md

Lines changed: 7 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -22,12 +22,6 @@ Determine whether this function is **exploitable from external HTTP input** (i.e
2222
- **Repository**: envoyproxy/envoy @ v1.31.1
2323
- **Language**: C++
2424
- **Focus Area**: Admin interface architecture and access control
25-
- **Relevant Files**:
26-
- `source/server/admin/admin.cc` (admin endpoint handlers)
27-
- `source/server/admin/admin.h` (admin interface definition)
28-
- `source/server/server.cc` (server initialization, admin interface binding)
29-
- `envoy/server/admin.h` (admin interface API)
30-
- Configuration files showing admin interface setup
3125

3226
## Background: Admin Interface vs Data Plane
3327

@@ -45,33 +39,13 @@ Envoy has two separate network interfaces:
4539

4640
## Expected Deliverable
4741

48-
Write your analysis to `/logs/agent/triage.md` with the following sections:
49-
50-
### 1. Summary
51-
- Function under analysis
52-
- Overall exploitability assessment (EXPLOITABLE / MITIGATED / UNREACHABLE)
53-
- Key finding (1-2 sentences)
54-
55-
### 2. Call Path Analysis
56-
- Complete call chain from HTTP request reception to `parseAuthority()`
57-
- Entry points (HTTP/1 vs HTTP/2)
58-
- Any conditional branching in the call path
59-
60-
### 3. Protection Mechanisms
61-
- Network interface binding configuration (which IP/port?)
62-
- Access control mechanisms (authentication, IP allowlisting, etc.)
63-
- Separation between admin interface and data plane listeners
64-
- How Envoy routes requests to admin handlers vs regular HTTP handlers
65-
66-
### 4. Exploitability Assessment
67-
- Can an external attacker (from the internet) trigger `/drain_listeners`?
68-
- What network configuration would be required for an attack to succeed?
69-
- Is the vulnerability **EXPLOITABLE** (reachable from external traffic) or **UNREACHABLE** (admin-only)?
70-
71-
### 5. Severity Assessment
72-
- If exploitable: impact (DoS, memory corruption, information disclosure, RCE)
73-
- If mitigated: description of the protection mechanism
74-
- Confidence level in your assessment
42+
Write your analysis to `/logs/agent/triage.md`.
43+
44+
Include enough evidence to support your exploitability conclusion, including:
45+
- the full call chain from request reception to the target handler
46+
- the boundary between data-plane and admin-plane request handling
47+
- the controls that prevent or permit external reachability
48+
- the network/configuration conditions under which the endpoint would become externally reachable
7549

7650
## Important Notes
7751

benchmarks/ccb_swebenchpro/tasks/instance_nodebb-nodebb-76c6e30282906ac664f2c9278fc90999b27b1f48-vd59a5728dfc977f44533186ace531248c2917516/tests/test.sh

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)