feat: US-010 - Task generation templates

sjarmak · claude · sjarmak · commit 9f8207057b24 · 2026-02-20T20:35:41.000Z
Creates templates/mcp_unique_task/ with 7 templates for generating
Harbor-compatible MCP-unique benchmark task directories:
- task.toml.j2: task metadata with mcp_suite, use_case_id, mcp_unique=true
- instruction.md.j2: customer-framed prompt with answer.json output format
- eval.sh.j2: exit-code-first evaluator calling oracle_checks.py
- task_spec.json.j2: full TaskSpec (prd/artifacts/evaluation/logging)
- Dockerfile.j2: baseline (clones local_checkout_repos only)
- Dockerfile.sg_only.j2: MCP-full (no clones, marks /tmp/.sg_only_mode)
- README.md: documents all template variables

All templates use Python string.Template syntax. Validated: all 6 content
templates substitute cleanly; task_spec.json produces valid JSON;
eval.sh passes bash -n syntax check.

Co-Authored-By: Claude Sonnet 4.6 &lt;noreply@anthropic.com&gt;
diff --git a/ralph-mcp-unique/prd.json b/ralph-mcp-unique/prd.json
@@ -238,7 +238,7 @@
         "All templates produce valid output when filled with example data"
       ],
       "priority": 10,
-      "passes": false,
+      "passes": true,
       "notes": "Suite directory is benchmarks/<mcp_suite>/<task_slug>/. Template eval.sh must handle sg_only mode guard and reward.txt path correctly."
     },
     {
diff --git a/ralph-mcp-unique/progress.txt b/ralph-mcp-unique/progress.txt
@@ -160,6 +160,27 @@
 [2026-02-20 20:21:26 UTC] Iteration 1 complete
 [2026-02-20 20:21:28 UTC] Iteration 2 started
 
+## 2026-02-20 - US-010: Task generation templates
+- Created `templates/mcp_unique_task/` with 7 files:
+  - `task.toml.j2`: task metadata with mcp_suite, use_case_id, repo_set_id, mcp_unique=true
+  - `instruction.md.j2`: customer-framed prompt with explicit answer.json output format
+  - `eval.sh.j2`: exit-code-first evaluator: calls oracle_checks.py, writes /logs/verifier/reward.txt, handles sg_only mode guard
+  - `task_spec.json.j2`: full TaskSpec with prd/artifacts/evaluation/logging sections
+  - `Dockerfile.j2`: baseline variant (clones local_checkout_repos)
+  - `Dockerfile.sg_only.j2`: MCP-full variant (no clones, marks /tmp/.sg_only_mode)
+  - `README.md`: documents all template variables with type/example/description table
+- All templates use Python string.Template syntax ($variable / ${variable}); bash $$ for literal dollar signs
+- Validation: all 6 templates substituted successfully with sample variables
+- task_spec.json.j2 validated as JSON after substitution; eval.sh.j2 passed bash -n
+- Files changed: `templates/mcp_unique_task/` (7 new files)
+- **Learnings for future iterations:**
+  - string.Template uses $variable syntax; bash code uses $$ to escape literal $
+  - Bash heredocs with $VAR in templates need $$VAR to avoid Template substitution
+  - task_spec.json.j2: "category" variable is separate from "task_family" — easy to miss
+  - Generated eval.sh uses $$(python3 ...) pattern for bash command substitution with $$-escaping
+  - Test sample_vars must include ALL variables used across ALL templates being tested
+---
+
 ## 2026-02-20 - US-009: Repo-set fixtures for starter pack
 - Created 5 repo-set fixtures in `fixtures/repo_sets/`:
   1. `kubernetes-ecosystem.json`: kubernetes/kubernetes (local) + sg-benchmarks/kubernetes-client-go + sg-benchmarks/kubernetes-api + etcd-io/etcd (all mcp_only). cross_org=true, 4 repos, Go.
diff --git a/templates/mcp_unique_task/Dockerfile.j2 b/templates/mcp_unique_task/Dockerfile.j2
@@ -0,0 +1,27 @@
+FROM ubuntu:22.04
+
+ENV DEBIAN_FRONTEND=noninteractive
+
+# Base tools
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    git \
+    ca-certificates \
+    curl \
+    python3 \
+    $language_packages \
+    && rm -rf /var/lib/apt/lists/*
+
+WORKDIR /workspace
+
+# Clone local checkout repos (baseline config: agent has local access to these)
+$clone_commands
+
+# Initialize git identity for agent commits
+RUN git config --global user.email "agent@example.com" && \
+    git config --global user.name "Agent" && \
+    git config --global safe.directory '*'
+
+# Create log directories
+RUN mkdir -p /logs/agent /logs/verifier
+
+ENTRYPOINT []
diff --git a/templates/mcp_unique_task/Dockerfile.sg_only.j2 b/templates/mcp_unique_task/Dockerfile.sg_only.j2
@@ -0,0 +1,30 @@
+# $task_id — sg_only variant
+# No local repo clone — agent uses Sourcegraph MCP exclusively for code access.
+# The verifier restores the full repo from /repo_full/ before scoring.
+
+FROM ubuntu:22.04
+
+ENV DEBIAN_FRONTEND=noninteractive
+
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    git \
+    ca-certificates \
+    python3 \
+    curl \
+    && rm -rf /var/lib/apt/lists/*
+
+WORKDIR /workspace
+
+# Empty workspace — agent discovers code via MCP tools only
+RUN git init && \
+    git config user.email "agent@example.com" && \
+    git config user.name "Agent" && \
+    git config --global safe.directory '*'
+
+# Create log directories
+RUN mkdir -p /logs/agent /logs/verifier
+
+# Mark sg_only mode — verifiers and eval scripts check this flag
+RUN touch /tmp/.sg_only_mode
+
+ENTRYPOINT []
diff --git a/templates/mcp_unique_task/README.md b/templates/mcp_unique_task/README.md
@@ -0,0 +1,111 @@
+# MCP-Unique Task Templates
+
+Templates for generating Harbor-compatible MCP-unique benchmark task directories.
+All templates use Python `string.Template` syntax: `$variable` or `${variable}`.
+Literal `$` signs in bash code are escaped as `$$`.
+
+## Template Files
+
+| File | Purpose |
+|------|---------|
+| `task.toml.j2` | Harbor task metadata and verification config |
+| `instruction.md.j2` | Agent instruction with customer-framed prompt |
+| `eval.sh.j2` | Exit-code-first evaluator calling oracle_checks.py |
+| `task_spec.json.j2` | Full TaskSpec (oracle + evaluation checks) |
+| `Dockerfile.j2` | Baseline: clones local_checkout_repos |
+| `Dockerfile.sg_only.j2` | MCP-Full: no clone, marks /tmp/.sg_only_mode |
+
+## Template Variables
+
+### task.toml.j2
+
+| Variable | Type | Example | Description |
+|----------|------|---------|-------------|
+| `$task_id` | string | `CCX-dep-trace-001` | Task identifier (CCX-<family>-<NNN>) |
+| `$task_description` | string | `Trace blast radius of client-go changes` | Short description |
+| `$primary_repo` | string | `kubernetes/kubernetes` | Main local repo (org/name) |
+| `$task_family` | string | `cross-repo-dep-trace` | Task family ID from registry |
+| `$language` | string | `go` | Primary programming language |
+| `$difficulty` | string | `medium` | Task difficulty: easy/medium/hard |
+| `$time_limit_sec` | int | `900` | Agent time limit in seconds |
+| `$mcp_suite` | string | `ccb_mcp_crossrepo_tracing` | CCB MCP suite name |
+| `$use_case_id` | int | `1` | Use case ID from registry (1-100) |
+| `$repo_set_id` | string | `kubernetes-ecosystem` | Fixture ID |
+
+### instruction.md.j2
+
+| Variable | Type | Example | Description |
+|----------|------|---------|-------------|
+| `$task_title` | string | `Kubernetes Dependency Blast Radius` | Human-readable title |
+| `$customer_prompt` | string | `Find all repos that import...` | Customer-framed task prompt |
+| `$context_description` | string | `You are a platform engineer...` | Background/role context |
+| `$local_repo_description` | string | `The local /workspace contains kubernetes/kubernetes` | What's available locally |
+| `$mcp_repos_description` | string | `- sg-benchmarks/kubernetes-client-go...` | MCP-only repos bullet list |
+| `$evaluation_criteria` | string | `- Recall of affected repos...` | What scoring checks |
+
+### eval.sh.j2
+
+| Variable | Type | Example | Description |
+|----------|------|---------|-------------|
+| `$task_id` | string | `CCX-dep-trace-001` | Task ID for logging |
+
+### task_spec.json.j2
+
+| Variable | Type | Example | Description |
+|----------|------|---------|-------------|
+| `$task_id` | string | `CCX-dep-trace-001` | Task identifier |
+| `$task_family` | string | `cross-repo-dep-trace` | Task family |
+| `$use_case_id` | int | `1` | Use case ID |
+| `$category` | string | `A` | Category A-J |
+| `$mcp_suite` | string | `ccb_mcp_crossrepo_tracing` | Suite name |
+| `$user_story` | string | `As a platform engineer...` | PRD user story |
+| `$constraints_json` | JSON array | `["Must cite file paths"]` | Constraint list as JSON string |
+| `$success_definition` | string | `Agent identifies all affected repos` | Success criteria |
+| `$seed_prompt` | string | `Find all repos importing...` | Curation seed prompt |
+| `$repo_set_id` | string | `kubernetes-ecosystem` | Fixture ID |
+| `$required_files_json` | JSON array | `[]` | Oracle file list (empty until curated) |
+| `$required_symbols_json` | JSON array | `[]` | Oracle symbol list (empty until curated) |
+| `$dependency_chains_json` | JSON array | `[]` | Oracle chains (empty until curated) |
+| `$evaluation_modes_json` | JSON array | `["deterministic"]` | Evaluation modes |
+| `$evaluation_checks_json` | JSON array | `[{"type":"file_set_match",...}]` | Check configurations |
+
+### Dockerfile.j2
+
+| Variable | Type | Example | Description |
+|----------|------|---------|-------------|
+| `$language_packages` | string | `golang-go` | apt packages for the language |
+| `$clone_commands` | string | `RUN git clone...` | Shell commands to clone local repos |
+
+### Dockerfile.sg_only.j2
+
+| Variable | Type | Example | Description |
+|----------|------|---------|-------------|
+| `$task_id` | string | `CCX-dep-trace-001` | Task ID for Dockerfile comment |
+
+## Generation
+
+Templates are filled by `scripts/generate_mcp_unique_tasks.py`:
+
+```bash
+python3 scripts/generate_mcp_unique_tasks.py --use-case-ids 1 --out benchmarks/
+python3 scripts/generate_mcp_unique_tasks.py --category A --dry-run
+python3 scripts/generate_mcp_unique_tasks.py --all --curate-oracle
+```
+
+## Layout
+
+Generated task directories follow:
+
+```
+benchmarks/<mcp_suite>/<task_slug>/
+├── environment/
+│   ├── Dockerfile        (baseline: clones local repos)
+│   └── Dockerfile.sg_only
+├── tests/
+│   ├── eval.sh
+│   ├── oracle_checks.py  (copied from scripts/ccb_metrics/oracle_checks.py)
+│   ├── task_spec.json
+│   └── oracle_answer.json (populated by curate_oracle.py)
+├── task.toml
+└── instruction.md
+```
diff --git a/templates/mcp_unique_task/eval.sh.j2 b/templates/mcp_unique_task/eval.sh.j2
@@ -0,0 +1,68 @@
+#!/bin/bash
+# eval.sh — MCP-unique benchmark evaluator for $task_id
+# Exit-code-first (SWE-Factory pattern):
+#   exit 0 — agent produced useful output (composite score > 0)
+#   exit 1 — total failure (composite score == 0 or missing answer)
+#
+# Writes /logs/verifier/reward.txt with the composite score [0.0, 1.0]
+
+set -euo pipefail
+
+TASK_ID="$task_id"
+ANSWER_PATH="/workspace/answer.json"
+TASK_SPEC_PATH="/tests/task_spec.json"
+ORACLE_CHECKS="/tests/oracle_checks.py"
+REWARD_PATH="/logs/verifier/reward.txt"
+
+mkdir -p /logs/verifier
+
+echo "=== $task_id evaluator ==="
+echo "Task spec: $$TASK_SPEC_PATH"
+echo "Answer:    $$ANSWER_PATH"
+echo ""
+
+# sg_only mode guard: restore full repo if verifier wrapper exists
+if [ -f /tmp/.sg_only_mode ] && [ -f /tests/sgonly_verifier_wrapper.sh ]; then
+    echo "sg_only mode: sourcing verifier wrapper..."
+    source /tests/sgonly_verifier_wrapper.sh
+fi
+
+# Verify answer file exists
+if [ ! -f "$$ANSWER_PATH" ]; then
+    echo "ERROR: answer.json not found at $$ANSWER_PATH"
+    echo "0.0" > "$$REWARD_PATH"
+    exit 1
+fi
+
+# Validate answer is valid JSON
+if ! python3 -c "import json; json.load(open('$$ANSWER_PATH'))" 2>/dev/null; then
+    echo "ERROR: answer.json is not valid JSON"
+    echo "0.0" > "$$REWARD_PATH"
+    exit 1
+fi
+
+echo "answer.json found and valid JSON"
+
+# Run oracle checks
+if [ ! -f "$$ORACLE_CHECKS" ]; then
+    echo "ERROR: oracle_checks.py not found at $$ORACLE_CHECKS"
+    echo "0.0" > "$$REWARD_PATH"
+    exit 1
+fi
+
+echo "Running oracle checks..."
+SCORE=$$(python3 "$$ORACLE_CHECKS" --answer "$$ANSWER_PATH" --spec "$$TASK_SPEC_PATH" --verbose 2>&1 | tee /dev/stderr | tail -1)
+
+# Validate score is a number
+if ! echo "$$SCORE" | python3 -c "import sys; float(sys.stdin.read().strip())" 2>/dev/null; then
+    echo "ERROR: oracle_checks.py did not return a valid score: $$SCORE"
+    echo "0.0" > "$$REWARD_PATH"
+    exit 1
+fi
+
+echo ""
+echo "Composite score: $$SCORE"
+echo "$$SCORE" > "$$REWARD_PATH"
+
+# Exit based on score (SWE-Factory exit-code-first pattern)
+python3 -c "import sys; sys.exit(0 if float('$$SCORE') > 0 else 1)"
diff --git a/templates/mcp_unique_task/instruction.md.j2 b/templates/mcp_unique_task/instruction.md.j2
@@ -0,0 +1,42 @@
+# $task_title
+
+## Your Task
+
+$customer_prompt
+
+## Context
+
+$context_description
+
+## Available Resources
+
+$local_repo_description
+
+**Note:** Additional repositories are accessible via Sourcegraph MCP tools:
+$mcp_repos_description
+
+## Output Format
+
+Create a file at `/workspace/answer.json` with your findings in the following structure:
+
+```json
+{
+  "files": [
+    {"repo": "org/repo-name", "path": "relative/path/to/file.go"}
+  ],
+  "symbols": [
+    {"repo": "org/repo-name", "path": "relative/path/to/file.go", "symbol": "SymbolName"}
+  ],
+  "chain": [
+    {"repo": "org/repo-name", "path": "relative/path/to/file.go", "symbol": "FunctionName"}
+  ],
+  "text": "Narrative explanation of your findings, citing repos and file paths."
+}
+```
+
+Include only the fields relevant to this task. Your answer is evaluated against a closed-world oracle — completeness matters.
+
+## Evaluation
+
+Your answer will be scored on:
+$evaluation_criteria
diff --git a/templates/mcp_unique_task/task.toml.j2 b/templates/mcp_unique_task/task.toml.j2
@@ -0,0 +1,28 @@
+version = "1.0"
+
+[metadata]
+name = "$task_id"
+description = "$task_description"
+license = "Apache-2.0"
+
+[task]
+id = "$task_id"
+repo = "$primary_repo"
+category = "$task_family"
+language = "$language"
+difficulty = "$difficulty"
+time_limit_sec = $time_limit_sec
+mcp_suite = "$mcp_suite"
+use_case_id = $use_case_id
+repo_set_id = "$repo_set_id"
+mcp_unique = true
+
+[verification]
+type = "eval"
+command = "bash /tests/eval.sh"
+
+reward_type = "score"
+description = "$task_description"
+
+[environment]
+build_timeout_sec = 600.0
diff --git a/templates/mcp_unique_task/task_spec.json.j2 b/templates/mcp_unique_task/task_spec.json.j2
@@ -0,0 +1,31 @@
+{
+  "id": "$task_id",
+  "family": "$task_family",
+  "use_case_id": $use_case_id,
+  "category": "$category",
+  "mcp_suite": "$mcp_suite",
+  "prd": {
+    "user_story": "$user_story",
+    "constraints": $constraints_json,
+    "success_definition": "$success_definition",
+    "seed_prompt": "$seed_prompt"
+  },
+  "artifacts": {
+    "repo_set_id": "$repo_set_id",
+    "oracle": {
+      "required_files": $required_files_json,
+      "required_symbols": $required_symbols_json,
+      "required_references": [],
+      "dependency_chains": $dependency_chains_json
+    }
+  },
+  "evaluation": {
+    "modes": $evaluation_modes_json,
+    "checks": $evaluation_checks_json,
+    "eval_script": "/tests/eval.sh",
+    "pass_exit_code": 0
+  },
+  "logging": {
+    "required_metrics": ["oracle_coverage", "time_to_first_oracle_hit_ms", "unique_repos_touched"]
+  }
+}

Original file line number	Diff line number	Diff line change
`@@ -238,7 +238,7 @@`
`238`	`238`	`"All templates produce valid output when filled with example data"`
`239`	`239`	`],`
`240`	`240`	`"priority": 10,`
`241`		`- "passes": false,`
	`241`	`+ "passes": true,`
`242`	`242`	`"notes": "Suite directory is benchmarks/<mcp_suite>/<task_slug>/. Template eval.sh must handle sg_only mode guard and reward.txt path correctly."`
`243`	`243`	`},`
`244`	`244`	`{`