Skip to content

Commit acfde6f

Browse files
authored
Add first batch of 50 easy tasks across services (#225)
1 parent c4e8b88 commit acfde6f

590 files changed

Lines changed: 10069 additions & 43 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

README.md

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -85,14 +85,22 @@ python -m pipeline \
8585
--k 1 \ # run once to quick start
8686
--models gpt-5 \ # or any model you configured
8787
--tasks file_property/size_classification
88+
# Add --task-suite easy to run the lightweight dataset (where available)
8889
```
8990

90-
Results are saved to `./results/{exp_name}/{model}__{mcp}/run-*/...` (e.g., `./results/test-run/gpt-5__filesystem/run-1/...`).
91+
Results are saved to `./results/{exp_name}/{model}__{mcp}/run-*/...` for the standard suite and `./results/{exp_name}/{model}__{mcp}-easy/run-*/...` when you run `--task-suite easy` (e.g., `./results/test-run/gpt-5__filesystem/run-1/...` or `./results/test-run/gpt-5__github-easy/run-1/...`).
9192

9293
---
9394

9495
## Run your evaluations
9596

97+
### Task suites (standard vs easy)
98+
99+
- Each MCP service now stores tasks under `tasks/<mcp>/<task_suite>/<category>/<task>/`.
100+
- `standard` (default) covers the full benchmark (127 tasks today).
101+
- `easy` hosts 10 lightweight tasks per MCP, ideal for smoke tests and CI (GitHub’s are already available under `tasks/github/easy`).
102+
- Switch suites with `--task-suite easy` (defaults to `--task-suite standard`).
103+
96104
### Single run (k=1)
97105
```bash
98106
# Run ALL tasks for a service
@@ -173,7 +181,7 @@ python -m src.aggregators.aggregate_results --exp-name exp --k 4 --single-run-mo
173181
## Contributing
174182

175183
Contributions are welcome:
176-
1. Add a new task under `tasks/<category_id>/<task_id>/` with `meta.json`, `description.md` and `verify.py`.
184+
1. Add a new task under `tasks/<mcp>/<task_suite>/<category_id>/<task_id>/` with `meta.json`, `description.md` and `verify.py`.
177185
2. Ensure local checks pass and open a PR.
178186
3. See `docs/contributing/make-contribution.md`.
179187

docs/contributing/make-contribution.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,8 @@
22

33
1. Fork the repository and create a feature branch.
44

5-
2. Add new tasks under `tasks/<category>/<task_n>/` with the files of `meta.json`, `description.md` and `verify.py`. Please refer to [Task Page](../datasets/task.md) for detailed instructions.
5+
2. Add new tasks under `tasks/<mcp>/<task_suite>/<category>/<task_id>/` with the files of `meta.json`, `description.md` and `verify.py`. Please refer to [Task Page](../datasets/task.md) for detailed instructions.
66

77
3. Ensure all tests pass.
88

9-
4. Submit a pull request — contributions are welcome!
9+
4. Submit a pull request — contributions are welcome!

docs/datasets/task.md

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -18,15 +18,17 @@ tasks
1818
1919
└───filesystem
2020
21-
└───file_context
21+
└───standard # task_suite (also supports `easy`)
2222
23-
└───create_file_write
24-
│ meta.json
25-
│ description.md
26-
│ verify.py
23+
└───file_context # category_id
24+
25+
└───create_file_write
26+
│ meta.json
27+
│ description.md
28+
│ verify.py
2729
```
2830

29-
Note that all tasks are placed under `tasks/`. `filesystem` refers to the environment for the MCP service.
31+
All tasks live under `tasks/<mcp>/<task_suite>/<category>/<task_id>/`. `filesystem` refers to the MCP service and `task_suite` captures the difficulty slice (`standard` benchmark vs `easy` smoke tests).
3032

3133
`meta.json` includes the meta information about the task, including the following key
3234
- task_id: the id of the task.
@@ -68,4 +70,4 @@ Accordingly, the `verify.py` contains the following functionalities
6870
- Check whether the target directory contains the file with target file name. [![Check Target File Existence](https://i.postimg.cc/Qx0Zwnf6/task-sample-verify-file-existence.png)](https://postimg.cc/7fGRTX87)
6971
- Check whether the target file contains the desired content `EXPECTED_PATTERNS = ["Hello Wolrd"]`. [![Check Content in Target File](https://i.postimg.cc/JzzMhWyV/task-sample-verify-check-content.png)](https://postimg.cc/w7ZSWZc0)
7072

71-
- If the outcome passes **all the above verification functionalities**, the task would be marked as successfully completed.
73+
- If the outcome passes **all the above verification functionalities**, the task would be marked as successfully completed.

docs/installation_and_docker_usage.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ The `run-task.sh` script provides simplified Docker usage:
4444
./run-task.sh --mcp MCPSERVICE --models MODEL_NAME --exp-name EXPNAME --tasks TASK --k K
4545
```
4646

47-
where *MODEL_NAME* refers to the model choice from the supported models (see [Introduction Page](./introduction.md) for more information), *EXPNAME* refers to customized experiment name, *TASK* refers to specific task or task group (see `tasks/` for more information), *K* refers to the time of independent experiments.
47+
where *MODEL_NAME* refers to the model choice from the supported models (see [Introduction Page](./introduction.md) for more information), *EXPNAME* refers to customized experiment name, *TASK* refers to specific task or task group (see `tasks/<mcp>/<task_suite>/...` for more information), *K* refers to the time of independent experiments.
4848

4949

5050
Additionally, the `run-benchmark.sh` script evaluates models across all MCP services:

pipeline.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,12 @@ def main():
5454
default="all",
5555
help='Tasks to run: (1). "all"; (2). "category"; or (3). "category/task".',
5656
)
57+
parser.add_argument(
58+
"--task-suite",
59+
default="standard",
60+
choices=["standard", "easy"],
61+
help="Task suite to run (default: standard). Use 'easy' to run the lightweight dataset.",
62+
)
5763
parser.add_argument(
5864
"--exp-name",
5965
default=None,
@@ -111,6 +117,7 @@ def main():
111117

112118
logger.info("MCPMark Evaluation")
113119
logger.info(f"Experiment: {args.exp_name} | {len(model_list)} Model(s): {', '.join(model_list)}")
120+
logger.info(f"Task suite: {args.task_suite}")
114121
if args.k > 1:
115122
logger.info(f"Running {args.k} evaluation runs for pass@k metrics")
116123

@@ -147,6 +154,7 @@ def main():
147154
output_dir=run_output_dir,
148155
reasoning_effort=args.reasoning_effort,
149156
agent_name=args.agent,
157+
task_suite=args.task_suite,
150158
)
151159

152160
pipeline.run_evaluation(args.tasks)

src/aggregators/aggregate_results.py

Lines changed: 52 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -20,8 +20,12 @@
2020
from src.aggregators.pricing import compute_cost_usd
2121

2222

23-
def discover_tasks() -> Dict[str, List[str]]:
24-
"""Discover all tasks from ./tasks directory."""
23+
# Supported difficulty splits in ./tasks/<service>/<task_set>/
24+
SUPPORTED_TASK_SETS = {"standard", "easy"}
25+
26+
27+
def discover_tasks(task_set: str = "standard") -> Dict[str, List[str]]:
28+
"""Discover all tasks from ./tasks directory filtered by task set."""
2529
tasks_dir = Path("./tasks")
2630

2731
all_tasks = {}
@@ -37,22 +41,39 @@ def discover_tasks() -> Dict[str, List[str]]:
3741
}
3842

3943
for mcp_service, task_dirs in service_mappings.items():
40-
tasks = []
44+
tasks: List[str] = []
4145
for task_dir_name in task_dirs:
4246
service_path = tasks_dir / task_dir_name
4347
if not service_path.exists():
4448
continue
45-
46-
# Find all category/task combinations
47-
for category_dir in service_path.iterdir():
48-
if not category_dir.is_dir() or category_dir.name.startswith("__"):
49-
continue
50-
51-
for task_dir in category_dir.iterdir():
52-
if task_dir.is_dir():
53-
# Use unified naming for both playwright and webarena variants
54-
tasks.append(f"{category_dir.name}__{task_dir.name}")
55-
49+
50+
selected_root = service_path / task_set
51+
52+
# Detect if this service has partitioned task sets (e.g. standard/easy)
53+
has_partitioned_layout = any(
54+
child.is_dir() and child.name in SUPPORTED_TASK_SETS
55+
for child in service_path.iterdir()
56+
)
57+
58+
if selected_root.exists():
59+
search_roots = [selected_root]
60+
elif has_partitioned_layout:
61+
# Requested task set missing for this service; skip it for this run
62+
print(f" ⚠️ No '{task_set}' tasks found under {service_path}")
63+
search_roots = []
64+
else:
65+
# Legacy layout without task sets – fall back to original structure
66+
search_roots = [service_path]
67+
68+
for root in search_roots:
69+
for category_dir in root.iterdir():
70+
if not category_dir.is_dir() or category_dir.name.startswith("__"):
71+
continue
72+
73+
for task_dir in category_dir.iterdir():
74+
if task_dir.is_dir() and not task_dir.name.startswith("__"):
75+
tasks.append(f"{category_dir.name}__{task_dir.name}")
76+
5677
all_tasks[mcp_service] = sorted(tasks)
5778

5879
return all_tasks
@@ -655,14 +676,19 @@ def render_section(title: str, section_data: Dict[str, Any]) -> List[str]:
655676
f"# {exp_name} - Evaluation Results",
656677
"",
657678
f"Generated: {summary['generated_at']}",
658-
"",
659679
]
660680

681+
task_set = summary.get("task_set")
682+
if task_set:
683+
lines.append(f"Task set: {task_set}")
684+
685+
lines.append("")
686+
661687
# Overall table
662688
lines.extend(render_section("Overall Performance", summary.get("overall", {})))
663689

664690
# Service tables: infer service keys from summary
665-
reserved = {"overall", "generated_at", "k", "experiment_name"}
691+
reserved = {"overall", "generated_at", "k", "experiment_name", "task_set"}
666692
service_keys = [key for key in summary.keys() if key not in reserved]
667693
# Keep stable order
668694
for service in sorted(service_keys):
@@ -875,6 +901,12 @@ def main():
875901
type=str,
876902
help="Comma-separated list of models that only need run-1"
877903
)
904+
parser.add_argument(
905+
"--task-set",
906+
choices=sorted(SUPPORTED_TASK_SETS),
907+
default="standard",
908+
help="Which task subset to aggregate (default: standard)"
909+
)
878910
parser.add_argument("--push", action="store_true", help="Push to GitHub (default to main)")
879911

880912
args = parser.parse_args()
@@ -894,8 +926,8 @@ def main():
894926
print(f"🔄 Processing experiment: {args.exp_name}")
895927

896928
# Discover all tasks
897-
print("📋 Discovering tasks...")
898-
all_tasks = discover_tasks()
929+
print(f"📋 Discovering tasks (task set: {args.task_set})...")
930+
all_tasks = discover_tasks(args.task_set)
899931
total_tasks = sum(len(tasks) for tasks in all_tasks.values())
900932
print(f" Found {total_tasks} tasks across {len(all_tasks)} services")
901933

@@ -920,6 +952,7 @@ def main():
920952
print("\n📊 Calculating metrics...")
921953
summary = calculate_metrics(complete_models, all_tasks, args.k, single_run_models)
922954
summary["experiment_name"] = args.exp_name
955+
summary["task_set"] = args.task_set
923956

924957
# Save summary
925958
summary_path = exp_dir / "summary.json"
@@ -954,4 +987,4 @@ def main():
954987

955988

956989
if __name__ == "__main__":
957-
exit(main())
990+
exit(main())

src/base/task_manager.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,7 @@ def __init__(
5555
mcp_service: str = None,
5656
task_class: type = None,
5757
task_organization: str = None,
58+
task_suite: str | None = "standard",
5859
):
5960
"""Initialize the base task manager.
6061
@@ -63,13 +64,15 @@ def __init__(
6364
mcp_service: MCP service name (e.g., 'notion', 'github', 'filesystem')
6465
task_class: Custom task class to use (defaults to BaseTask)
6566
task_organization: 'file' or 'directory' based task organization
67+
task_suite: Logical task suite (e.g., 'standard', 'easy')
6668
"""
6769
self.tasks_root = tasks_root
6870
self.mcp_service = mcp_service or self.__class__.__name__.lower().replace(
6971
"taskmanager", ""
7072
)
7173
self.task_class = task_class or BaseTask
7274
self.task_organization = task_organization
75+
self.task_suite = task_suite
7376
self._tasks_cache = None
7477

7578
# =========================================================================
@@ -85,6 +88,8 @@ def discover_all_tasks(self) -> List[BaseTask]:
8588
service_dir = self.tasks_root / (
8689
self.mcp_service or self._get_service_directory_name()
8790
)
91+
if self.task_suite:
92+
service_dir = service_dir / self.task_suite
8893

8994
if not service_dir.exists():
9095
logger.warning(
@@ -112,9 +117,10 @@ def discover_all_tasks(self) -> List[BaseTask]:
112117
# Sort by category_id and a stringified task_id to handle both numeric IDs and slugs uniformly
113118
self._tasks_cache = sorted(tasks, key=lambda t: (t.category_id, str(t.task_id)))
114119
logger.info(
115-
"Discovered %d %s tasks across all categories",
120+
"Discovered %d %s tasks across all categories (suite=%s)",
116121
len(self._tasks_cache),
117122
self.mcp_service.title(),
123+
self.task_suite or "default",
118124
)
119125
return self._tasks_cache
120126

src/evaluator.py

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,11 +27,13 @@ def __init__(
2727
output_dir: Path = None,
2828
reasoning_effort: str = "default",
2929
agent_name: str = "mcpmark",
30+
task_suite: str = "standard",
3031
):
3132
# Main configuration
3233
self.mcp_service = mcp_service
3334
self.timeout = timeout
3435
self.agent_name = (agent_name or "mcpmark").lower()
36+
self.task_suite = (task_suite or "standard").lower()
3537
if self.agent_name not in AGENT_REGISTRY:
3638
raise ValueError(f"Unsupported agent '{agent_name}'. Available: {sorted(AGENT_REGISTRY)}")
3739

@@ -48,7 +50,9 @@ def __init__(
4850
self.litellm_run_model_name = None
4951

5052
# Initialize managers using the factory pattern (simplified)
51-
self.task_manager = MCPServiceFactory.create_task_manager(mcp_service)
53+
self.task_manager = MCPServiceFactory.create_task_manager(
54+
mcp_service, task_suite=self.task_suite
55+
)
5256
self.state_manager = MCPServiceFactory.create_state_manager(mcp_service)
5357

5458
# Obtain static service configuration from state manager (e.g., notion_key)
@@ -80,7 +84,9 @@ def __init__(
8084
model_slug = self.model_name.replace(".", "-")
8185

8286
service_for_dir = "playwright" if mcp_service == "playwright_webarena" else mcp_service
83-
self.base_experiment_dir = output_dir / f"{model_slug}__{service_for_dir}" / exp_name
87+
suite_suffix = "" if self.task_suite in ("standard", "", None) else f"-{self.task_suite}"
88+
service_dir_name = f"{service_for_dir}{suite_suffix}"
89+
self.base_experiment_dir = output_dir / f"{model_slug}__{service_dir_name}" / exp_name
8490
self.base_experiment_dir.mkdir(parents=True, exist_ok=True)
8591

8692
def _format_duration(self, seconds: float) -> str:

src/mcp_services/filesystem/filesystem_task_manager.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ class FilesystemTask(BaseTask):
3030
class FilesystemTaskManager(BaseTaskManager):
3131
"""Simplified filesystem task manager using enhanced base class."""
3232

33-
def __init__(self, tasks_root: Path = None):
33+
def __init__(self, tasks_root: Path = None, task_suite: str = "standard"):
3434
"""Initialize filesystem task manager."""
3535
if tasks_root is None:
3636
tasks_root = Path(__file__).resolve().parents[3] / "tasks"
@@ -40,6 +40,7 @@ def __init__(self, tasks_root: Path = None):
4040
mcp_service="filesystem",
4141
task_class=FilesystemTask,
4242
task_organization="directory",
43+
task_suite=task_suite,
4344
)
4445

4546
# Override only what's needed for filesystem-specific behavior

src/mcp_services/github/github_state_manager.py

Lines changed: 29 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -626,7 +626,35 @@ def _request_with_retry(
626626

627627
# Initial state for each task category is resolved via self.initial_state_mapping
628628
def select_initial_state_for_task(self, task_category: str) -> Optional[str]:
629-
return self.initial_state_mapping.get(task_category)
629+
"""Resolve template name for a task category with light normalization."""
630+
if not task_category:
631+
return None
632+
633+
candidate_keys = []
634+
candidate_keys.append(task_category)
635+
636+
# Allow users to swap between hyphen/underscore naming conventions.
637+
hyphen_to_underscore = task_category.replace("-", "_")
638+
if hyphen_to_underscore not in candidate_keys:
639+
candidate_keys.append(hyphen_to_underscore)
640+
641+
underscore_to_hyphen = task_category.replace("_", "-")
642+
if underscore_to_hyphen not in candidate_keys:
643+
candidate_keys.append(underscore_to_hyphen)
644+
645+
for key in candidate_keys:
646+
template = self.initial_state_mapping.get(key)
647+
if template:
648+
if key != task_category:
649+
logger.debug(
650+
"| Resolved GitHub template for %s via alias %s -> %s",
651+
task_category,
652+
key,
653+
template,
654+
)
655+
return template
656+
657+
return None
630658

631659
def extract_repo_info_from_url(self, repo_url: str) -> tuple[str, str]:
632660
"""Extract owner and repo name from GitHub URL."""

0 commit comments

Comments
 (0)