| id | <category>_task_<N>_<short_name> |
|---|---|
| name | Human-readable task name |
| category | <category> |
| timeout_seconds | 300 |
The task instruction sent to the agent. Write it as if speaking directly to the agent.
All input files are located under /tmp_workspace/. The agent should save outputs to /tmp_workspace/results/.
Important: Do NOT use ## (level-2 headings) inside the Prompt section. The parser splits sections by ## headings, so a ## here will truncate the prompt. Use ### or lower for any sub-headings within the prompt.
Describe what a correct agent execution looks like, step by step. This section is for human readers only and is not sent to the agent.
- Criterion 1
- Criterion 2
- ...
def grade(**kwargs) -> dict:
"""
Return a dict of metric_name -> float (0.0 to 1.0).
Must include an "overall_score" key.
Runs inside the container with cwd=/tmp_workspace.
"""
scores = {}
# ... grading logic ...
scores["overall_score"] = 0.0
return scoresworkspace/<category>/task_<N>_<short_name>