Skip to content

Commit 9f82070

Browse files
sjarmakclaude
andcommitted
feat: US-010 - Task generation templates
Creates templates/mcp_unique_task/ with 7 templates for generating Harbor-compatible MCP-unique benchmark task directories: - task.toml.j2: task metadata with mcp_suite, use_case_id, mcp_unique=true - instruction.md.j2: customer-framed prompt with answer.json output format - eval.sh.j2: exit-code-first evaluator calling oracle_checks.py - task_spec.json.j2: full TaskSpec (prd/artifacts/evaluation/logging) - Dockerfile.j2: baseline (clones local_checkout_repos only) - Dockerfile.sg_only.j2: MCP-full (no clones, marks /tmp/.sg_only_mode) - README.md: documents all template variables All templates use Python string.Template syntax. Validated: all 6 content templates substitute cleanly; task_spec.json produces valid JSON; eval.sh passes bash -n syntax check. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent cca10d6 commit 9f82070

File tree

9 files changed

+359
-1
lines changed

9 files changed

+359
-1
lines changed

ralph-mcp-unique/prd.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -238,7 +238,7 @@
238238
"All templates produce valid output when filled with example data"
239239
],
240240
"priority": 10,
241-
"passes": false,
241+
"passes": true,
242242
"notes": "Suite directory is benchmarks/<mcp_suite>/<task_slug>/. Template eval.sh must handle sg_only mode guard and reward.txt path correctly."
243243
},
244244
{

ralph-mcp-unique/progress.txt

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -160,6 +160,27 @@
160160
[2026-02-20 20:21:26 UTC] Iteration 1 complete
161161
[2026-02-20 20:21:28 UTC] Iteration 2 started
162162

163+
## 2026-02-20 - US-010: Task generation templates
164+
- Created `templates/mcp_unique_task/` with 7 files:
165+
- `task.toml.j2`: task metadata with mcp_suite, use_case_id, repo_set_id, mcp_unique=true
166+
- `instruction.md.j2`: customer-framed prompt with explicit answer.json output format
167+
- `eval.sh.j2`: exit-code-first evaluator: calls oracle_checks.py, writes /logs/verifier/reward.txt, handles sg_only mode guard
168+
- `task_spec.json.j2`: full TaskSpec with prd/artifacts/evaluation/logging sections
169+
- `Dockerfile.j2`: baseline variant (clones local_checkout_repos)
170+
- `Dockerfile.sg_only.j2`: MCP-full variant (no clones, marks /tmp/.sg_only_mode)
171+
- `README.md`: documents all template variables with type/example/description table
172+
- All templates use Python string.Template syntax ($variable / ${variable}); bash $$ for literal dollar signs
173+
- Validation: all 6 templates substituted successfully with sample variables
174+
- task_spec.json.j2 validated as JSON after substitution; eval.sh.j2 passed bash -n
175+
- Files changed: `templates/mcp_unique_task/` (7 new files)
176+
- **Learnings for future iterations:**
177+
- string.Template uses $variable syntax; bash code uses $$ to escape literal $
178+
- Bash heredocs with $VAR in templates need $$VAR to avoid Template substitution
179+
- task_spec.json.j2: "category" variable is separate from "task_family" — easy to miss
180+
- Generated eval.sh uses $$(python3 ...) pattern for bash command substitution with $$-escaping
181+
- Test sample_vars must include ALL variables used across ALL templates being tested
182+
---
183+
163184
## 2026-02-20 - US-009: Repo-set fixtures for starter pack
164185
- Created 5 repo-set fixtures in `fixtures/repo_sets/`:
165186
1. `kubernetes-ecosystem.json`: kubernetes/kubernetes (local) + sg-benchmarks/kubernetes-client-go + sg-benchmarks/kubernetes-api + etcd-io/etcd (all mcp_only). cross_org=true, 4 repos, Go.
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
FROM ubuntu:22.04
2+
3+
ENV DEBIAN_FRONTEND=noninteractive
4+
5+
# Base tools
6+
RUN apt-get update && apt-get install -y --no-install-recommends \
7+
git \
8+
ca-certificates \
9+
curl \
10+
python3 \
11+
$language_packages \
12+
&& rm -rf /var/lib/apt/lists/*
13+
14+
WORKDIR /workspace
15+
16+
# Clone local checkout repos (baseline config: agent has local access to these)
17+
$clone_commands
18+
19+
# Initialize git identity for agent commits
20+
RUN git config --global user.email "agent@example.com" && \
21+
git config --global user.name "Agent" && \
22+
git config --global safe.directory '*'
23+
24+
# Create log directories
25+
RUN mkdir -p /logs/agent /logs/verifier
26+
27+
ENTRYPOINT []
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
# $task_id — sg_only variant
2+
# No local repo clone — agent uses Sourcegraph MCP exclusively for code access.
3+
# The verifier restores the full repo from /repo_full/ before scoring.
4+
5+
FROM ubuntu:22.04
6+
7+
ENV DEBIAN_FRONTEND=noninteractive
8+
9+
RUN apt-get update && apt-get install -y --no-install-recommends \
10+
git \
11+
ca-certificates \
12+
python3 \
13+
curl \
14+
&& rm -rf /var/lib/apt/lists/*
15+
16+
WORKDIR /workspace
17+
18+
# Empty workspace — agent discovers code via MCP tools only
19+
RUN git init && \
20+
git config user.email "agent@example.com" && \
21+
git config user.name "Agent" && \
22+
git config --global safe.directory '*'
23+
24+
# Create log directories
25+
RUN mkdir -p /logs/agent /logs/verifier
26+
27+
# Mark sg_only mode — verifiers and eval scripts check this flag
28+
RUN touch /tmp/.sg_only_mode
29+
30+
ENTRYPOINT []
Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
# MCP-Unique Task Templates
2+
3+
Templates for generating Harbor-compatible MCP-unique benchmark task directories.
4+
All templates use Python `string.Template` syntax: `$variable` or `${variable}`.
5+
Literal `$` signs in bash code are escaped as `$$`.
6+
7+
## Template Files
8+
9+
| File | Purpose |
10+
|------|---------|
11+
| `task.toml.j2` | Harbor task metadata and verification config |
12+
| `instruction.md.j2` | Agent instruction with customer-framed prompt |
13+
| `eval.sh.j2` | Exit-code-first evaluator calling oracle_checks.py |
14+
| `task_spec.json.j2` | Full TaskSpec (oracle + evaluation checks) |
15+
| `Dockerfile.j2` | Baseline: clones local_checkout_repos |
16+
| `Dockerfile.sg_only.j2` | MCP-Full: no clone, marks /tmp/.sg_only_mode |
17+
18+
## Template Variables
19+
20+
### task.toml.j2
21+
22+
| Variable | Type | Example | Description |
23+
|----------|------|---------|-------------|
24+
| `$task_id` | string | `CCX-dep-trace-001` | Task identifier (CCX-<family>-<NNN>) |
25+
| `$task_description` | string | `Trace blast radius of client-go changes` | Short description |
26+
| `$primary_repo` | string | `kubernetes/kubernetes` | Main local repo (org/name) |
27+
| `$task_family` | string | `cross-repo-dep-trace` | Task family ID from registry |
28+
| `$language` | string | `go` | Primary programming language |
29+
| `$difficulty` | string | `medium` | Task difficulty: easy/medium/hard |
30+
| `$time_limit_sec` | int | `900` | Agent time limit in seconds |
31+
| `$mcp_suite` | string | `ccb_mcp_crossrepo_tracing` | CCB MCP suite name |
32+
| `$use_case_id` | int | `1` | Use case ID from registry (1-100) |
33+
| `$repo_set_id` | string | `kubernetes-ecosystem` | Fixture ID |
34+
35+
### instruction.md.j2
36+
37+
| Variable | Type | Example | Description |
38+
|----------|------|---------|-------------|
39+
| `$task_title` | string | `Kubernetes Dependency Blast Radius` | Human-readable title |
40+
| `$customer_prompt` | string | `Find all repos that import...` | Customer-framed task prompt |
41+
| `$context_description` | string | `You are a platform engineer...` | Background/role context |
42+
| `$local_repo_description` | string | `The local /workspace contains kubernetes/kubernetes` | What's available locally |
43+
| `$mcp_repos_description` | string | `- sg-benchmarks/kubernetes-client-go...` | MCP-only repos bullet list |
44+
| `$evaluation_criteria` | string | `- Recall of affected repos...` | What scoring checks |
45+
46+
### eval.sh.j2
47+
48+
| Variable | Type | Example | Description |
49+
|----------|------|---------|-------------|
50+
| `$task_id` | string | `CCX-dep-trace-001` | Task ID for logging |
51+
52+
### task_spec.json.j2
53+
54+
| Variable | Type | Example | Description |
55+
|----------|------|---------|-------------|
56+
| `$task_id` | string | `CCX-dep-trace-001` | Task identifier |
57+
| `$task_family` | string | `cross-repo-dep-trace` | Task family |
58+
| `$use_case_id` | int | `1` | Use case ID |
59+
| `$category` | string | `A` | Category A-J |
60+
| `$mcp_suite` | string | `ccb_mcp_crossrepo_tracing` | Suite name |
61+
| `$user_story` | string | `As a platform engineer...` | PRD user story |
62+
| `$constraints_json` | JSON array | `["Must cite file paths"]` | Constraint list as JSON string |
63+
| `$success_definition` | string | `Agent identifies all affected repos` | Success criteria |
64+
| `$seed_prompt` | string | `Find all repos importing...` | Curation seed prompt |
65+
| `$repo_set_id` | string | `kubernetes-ecosystem` | Fixture ID |
66+
| `$required_files_json` | JSON array | `[]` | Oracle file list (empty until curated) |
67+
| `$required_symbols_json` | JSON array | `[]` | Oracle symbol list (empty until curated) |
68+
| `$dependency_chains_json` | JSON array | `[]` | Oracle chains (empty until curated) |
69+
| `$evaluation_modes_json` | JSON array | `["deterministic"]` | Evaluation modes |
70+
| `$evaluation_checks_json` | JSON array | `[{"type":"file_set_match",...}]` | Check configurations |
71+
72+
### Dockerfile.j2
73+
74+
| Variable | Type | Example | Description |
75+
|----------|------|---------|-------------|
76+
| `$language_packages` | string | `golang-go` | apt packages for the language |
77+
| `$clone_commands` | string | `RUN git clone...` | Shell commands to clone local repos |
78+
79+
### Dockerfile.sg_only.j2
80+
81+
| Variable | Type | Example | Description |
82+
|----------|------|---------|-------------|
83+
| `$task_id` | string | `CCX-dep-trace-001` | Task ID for Dockerfile comment |
84+
85+
## Generation
86+
87+
Templates are filled by `scripts/generate_mcp_unique_tasks.py`:
88+
89+
```bash
90+
python3 scripts/generate_mcp_unique_tasks.py --use-case-ids 1 --out benchmarks/
91+
python3 scripts/generate_mcp_unique_tasks.py --category A --dry-run
92+
python3 scripts/generate_mcp_unique_tasks.py --all --curate-oracle
93+
```
94+
95+
## Layout
96+
97+
Generated task directories follow:
98+
99+
```
100+
benchmarks/<mcp_suite>/<task_slug>/
101+
├── environment/
102+
│ ├── Dockerfile (baseline: clones local repos)
103+
│ └── Dockerfile.sg_only
104+
├── tests/
105+
│ ├── eval.sh
106+
│ ├── oracle_checks.py (copied from scripts/ccb_metrics/oracle_checks.py)
107+
│ ├── task_spec.json
108+
│ └── oracle_answer.json (populated by curate_oracle.py)
109+
├── task.toml
110+
└── instruction.md
111+
```
Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
#!/bin/bash
2+
# eval.sh — MCP-unique benchmark evaluator for $task_id
3+
# Exit-code-first (SWE-Factory pattern):
4+
# exit 0 — agent produced useful output (composite score > 0)
5+
# exit 1 — total failure (composite score == 0 or missing answer)
6+
#
7+
# Writes /logs/verifier/reward.txt with the composite score [0.0, 1.0]
8+
9+
set -euo pipefail
10+
11+
TASK_ID="$task_id"
12+
ANSWER_PATH="/workspace/answer.json"
13+
TASK_SPEC_PATH="/tests/task_spec.json"
14+
ORACLE_CHECKS="/tests/oracle_checks.py"
15+
REWARD_PATH="/logs/verifier/reward.txt"
16+
17+
mkdir -p /logs/verifier
18+
19+
echo "=== $task_id evaluator ==="
20+
echo "Task spec: $$TASK_SPEC_PATH"
21+
echo "Answer: $$ANSWER_PATH"
22+
echo ""
23+
24+
# sg_only mode guard: restore full repo if verifier wrapper exists
25+
if [ -f /tmp/.sg_only_mode ] && [ -f /tests/sgonly_verifier_wrapper.sh ]; then
26+
echo "sg_only mode: sourcing verifier wrapper..."
27+
source /tests/sgonly_verifier_wrapper.sh
28+
fi
29+
30+
# Verify answer file exists
31+
if [ ! -f "$$ANSWER_PATH" ]; then
32+
echo "ERROR: answer.json not found at $$ANSWER_PATH"
33+
echo "0.0" > "$$REWARD_PATH"
34+
exit 1
35+
fi
36+
37+
# Validate answer is valid JSON
38+
if ! python3 -c "import json; json.load(open('$$ANSWER_PATH'))" 2>/dev/null; then
39+
echo "ERROR: answer.json is not valid JSON"
40+
echo "0.0" > "$$REWARD_PATH"
41+
exit 1
42+
fi
43+
44+
echo "answer.json found and valid JSON"
45+
46+
# Run oracle checks
47+
if [ ! -f "$$ORACLE_CHECKS" ]; then
48+
echo "ERROR: oracle_checks.py not found at $$ORACLE_CHECKS"
49+
echo "0.0" > "$$REWARD_PATH"
50+
exit 1
51+
fi
52+
53+
echo "Running oracle checks..."
54+
SCORE=$$(python3 "$$ORACLE_CHECKS" --answer "$$ANSWER_PATH" --spec "$$TASK_SPEC_PATH" --verbose 2>&1 | tee /dev/stderr | tail -1)
55+
56+
# Validate score is a number
57+
if ! echo "$$SCORE" | python3 -c "import sys; float(sys.stdin.read().strip())" 2>/dev/null; then
58+
echo "ERROR: oracle_checks.py did not return a valid score: $$SCORE"
59+
echo "0.0" > "$$REWARD_PATH"
60+
exit 1
61+
fi
62+
63+
echo ""
64+
echo "Composite score: $$SCORE"
65+
echo "$$SCORE" > "$$REWARD_PATH"
66+
67+
# Exit based on score (SWE-Factory exit-code-first pattern)
68+
python3 -c "import sys; sys.exit(0 if float('$$SCORE') > 0 else 1)"
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
# $task_title
2+
3+
## Your Task
4+
5+
$customer_prompt
6+
7+
## Context
8+
9+
$context_description
10+
11+
## Available Resources
12+
13+
$local_repo_description
14+
15+
**Note:** Additional repositories are accessible via Sourcegraph MCP tools:
16+
$mcp_repos_description
17+
18+
## Output Format
19+
20+
Create a file at `/workspace/answer.json` with your findings in the following structure:
21+
22+
```json
23+
{
24+
"files": [
25+
{"repo": "org/repo-name", "path": "relative/path/to/file.go"}
26+
],
27+
"symbols": [
28+
{"repo": "org/repo-name", "path": "relative/path/to/file.go", "symbol": "SymbolName"}
29+
],
30+
"chain": [
31+
{"repo": "org/repo-name", "path": "relative/path/to/file.go", "symbol": "FunctionName"}
32+
],
33+
"text": "Narrative explanation of your findings, citing repos and file paths."
34+
}
35+
```
36+
37+
Include only the fields relevant to this task. Your answer is evaluated against a closed-world oracle — completeness matters.
38+
39+
## Evaluation
40+
41+
Your answer will be scored on:
42+
$evaluation_criteria
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
version = "1.0"
2+
3+
[metadata]
4+
name = "$task_id"
5+
description = "$task_description"
6+
license = "Apache-2.0"
7+
8+
[task]
9+
id = "$task_id"
10+
repo = "$primary_repo"
11+
category = "$task_family"
12+
language = "$language"
13+
difficulty = "$difficulty"
14+
time_limit_sec = $time_limit_sec
15+
mcp_suite = "$mcp_suite"
16+
use_case_id = $use_case_id
17+
repo_set_id = "$repo_set_id"
18+
mcp_unique = true
19+
20+
[verification]
21+
type = "eval"
22+
command = "bash /tests/eval.sh"
23+
24+
reward_type = "score"
25+
description = "$task_description"
26+
27+
[environment]
28+
build_timeout_sec = 600.0
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
{
2+
"id": "$task_id",
3+
"family": "$task_family",
4+
"use_case_id": $use_case_id,
5+
"category": "$category",
6+
"mcp_suite": "$mcp_suite",
7+
"prd": {
8+
"user_story": "$user_story",
9+
"constraints": $constraints_json,
10+
"success_definition": "$success_definition",
11+
"seed_prompt": "$seed_prompt"
12+
},
13+
"artifacts": {
14+
"repo_set_id": "$repo_set_id",
15+
"oracle": {
16+
"required_files": $required_files_json,
17+
"required_symbols": $required_symbols_json,
18+
"required_references": [],
19+
"dependency_chains": $dependency_chains_json
20+
}
21+
},
22+
"evaluation": {
23+
"modes": $evaluation_modes_json,
24+
"checks": $evaluation_checks_json,
25+
"eval_script": "/tests/eval.sh",
26+
"pass_exit_code": 0
27+
},
28+
"logging": {
29+
"required_metrics": ["oracle_coverage", "time_to_first_oracle_hit_ms", "unique_repos_touched"]
30+
}
31+
}

0 commit comments

Comments
 (0)