Skip to content

Commit be5a5a0

Browse files
committed
update README.md
1 parent 83a2a9d commit be5a5a0

1 file changed

Lines changed: 30 additions & 49 deletions

File tree

tests/benchmarks/mcp_universe/README.md

Lines changed: 30 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,9 @@ This directory contains the integration of the MCP-Universe repository managemen
44

55
## Overview
66

7-
MCP-Universe is a comprehensive benchmark from Salesforce AI Research that evaluates LLMs on realistic tasks using real-world MCP servers. This integration focuses on the **repository management domain** with:
7+
MCP-Universe is a comprehensive benchmark from Salesforce AI Research that evaluates LLMs on realistic tasks using real-world MCP servers. This integration focuses on the repository management domain with:
88

9-
- **28 pure GitHub tasks** (github_task_0001 through github_task_0030, excluding 0013 and 0020)
9+
- 28 GitHub tasks
1010
- Tests realistic GitHub operations including:
1111
- Creating repositories and branches
1212
- Managing files and commits
@@ -18,22 +18,21 @@ MCP-Universe is a comprehensive benchmark from Salesforce AI Research that evalu
1818

1919
### Prerequisites
2020

21-
1. **Docker** - Required to run the GitHub MCP server
21+
1. Docker - Required to run the GitHub MCP server
2222
- Install Docker Desktop: https://www.docker.com/products/docker-desktop
23-
- **Start Docker Desktop** before running tests
23+
- Start Docker Desktop before running tests
2424
- Verify: `docker --version`
2525
- Using pinned version v0.15.0 for research reproducibility
26+
- If you have multiple versions of the GitHub MCP server image, ensure v0.15.0 is tagged as `latest` or is the only version installed
2627

27-
2. **GitHub Personal Access Token** - For GitHub API access
28-
- **CRITICAL**: Use a dedicated test GitHub account for safety
28+
2. GitHub Personal Access Token - For GitHub API access
29+
- Use a test GitHub account for safety
2930
- Create token: https://github.com/settings/tokens
30-
- Required scopes: `repo`, `delete_repo`
31+
- Required scopes: All scopes
3132

32-
3. **LLM API Key**
33-
- OpenAI API key for GPT models, OR
34-
- Anthropic API key for Claude models
33+
3. LLM API Key
3534

36-
4. **Python 3.13+** with [uv](https://docs.astral.sh/uv/)
35+
4. Python 3.13+ with [uv](https://docs.astral.sh/uv/)
3736

3837
### Installation
3938

@@ -51,37 +50,32 @@ docker pull ghcr.io/github/github-mcp-server:v0.15.0
5150

5251
### Environment Variables
5352

54-
**Required** - tests will fail without these:
55-
5653
```bash
54+
# Required; tests will fail without these
55+
# Use a test GitHub account; the agent performs real operations
5756
export GITHUB_PERSONAL_ACCESS_TOKEN="your_github_token"
5857
export GITHUB_PERSONAL_ACCOUNT_NAME="your_github_username"
59-
```
60-
61-
**LLM API Key** - one of these depending on model:
6258

63-
```bash
64-
# For OpenAI models (gpt-4o, gpt-4o-mini, etc.)
59+
# LLM API Key
60+
# For OpenAI models (gpt-5, gpt-4o, gpt-4o-mini, etc.)
6561
export OPENAI_API_KEY="your_openai_key"
6662

6763
# For Anthropic models (claude-sonnet-4-5, etc.)
6864
export ANTHROPIC_API_KEY="your_anthropic_key"
6965
```
7066

71-
**IMPORTANT**: Use a dedicated test GitHub account. The agent performs real operations including creating and deleting repositories.
72-
7367
### Running Tests
7468

7569
Run all 28 tasks:
7670

7771
```bash
78-
pytest tests/benchmarks/mcp_universe/test_mcp_universe.py --model gpt-4o-mini -v
72+
pytest tests/benchmarks/mcp_universe/test_mcp_universe.py --model gpt-5 -v
7973
```
8074

8175
Run a single task:
8276

8377
```bash
84-
pytest tests/benchmarks/mcp_universe/test_mcp_universe.py::test_mcp_universe[github_task_0001] --model gpt-4o-mini -v
78+
pytest tests/benchmarks/mcp_universe/test_mcp_universe.py::test_mcp_universe[github_task_0001] --model gpt-5 -v
8579
```
8680

8781
Run with different models:
@@ -90,7 +84,7 @@ Run with different models:
9084
# GPT-4o
9185
pytest tests/benchmarks/mcp_universe/test_mcp_universe.py --model gpt-4o
9286

93-
# Claude Sonnet
87+
# Claude Sonnet 4.5
9488
pytest tests/benchmarks/mcp_universe/test_mcp_universe.py --model claude-sonnet-4-5
9589
```
9690

@@ -102,6 +96,19 @@ pytest tests/benchmarks/mcp_universe/test_mcp_universe.py --model claude-sonnet-
10296
| `--temperature` | `0.001` | Temperature for LLM sampling |
10397
| `--output-dir` | `outputs` | Base directory for outputs (logs written to `{output_dir}/raw/`) |
10498
| `--validate-only` | - | Skip agent execution, only run evaluation against live GitHub |
99+
| `--toolset` | `full` | Tool availability: `full` (all 93 tools) or `minimal` (19 essential tools) |
100+
101+
### Toolset Comparison
102+
103+
The `--toolset` flag allows comparing agent performance with different tool availability:
104+
105+
```bash
106+
# Full toolset (default): All 93 GitHub MCP tools
107+
pytest tests/benchmarks/mcp_universe/ --toolset full -v
108+
109+
# Minimal toolset: 19 essential tools derived from successful test runs
110+
pytest tests/benchmarks/mcp_universe/ --toolset minimal -v
111+
```
105112

106113
### Validate Mode
107114

@@ -127,36 +134,10 @@ This is useful if you previously ran the agent and want to re-check the GitHub s
127134
| `instruction.txt` | System instruction for the agent |
128135
| `reporting.py` | Human-readable log formatting |
129136

130-
### Environment Variables
131-
132-
| Variable | Used By | Purpose |
133-
|----------|---------|---------|
134-
| `GITHUB_PERSONAL_ACCESS_TOKEN` | MCP Server, Evaluator | GitHub API authentication |
135-
| `GITHUB_PERSONAL_ACCOUNT_NAME` | Evaluator | Template substitution in task assertions |
136-
| `OPENAI_API_KEY` | FastAgent | OpenAI model access |
137-
| `ANTHROPIC_API_KEY` | FastAgent | Anthropic model access |
138-
139137
### MCP Server Configuration
140138

141139
The GitHub MCP server runs in Docker:
142140
- Image: `ghcr.io/github/github-mcp-server:v0.15.0`
143141
- Required env var: `GITHUB_PERSONAL_ACCESS_TOKEN`
144142

145143
Only the access token is passed to the Docker container. The account name is used locally by the evaluator for template substitution in task assertions (e.g., checking `{{GITHUB_PERSONAL_ACCOUNT_NAME}}/repo-name` exists).
146-
147-
## Troubleshooting
148-
149-
### "Docker not found"
150-
Ensure Docker Desktop is running and restart your terminal.
151-
152-
### "GITHUB_PERSONAL_ACCESS_TOKEN environment variable not set"
153-
Export the required environment variables before running tests.
154-
155-
### "repository doesn't exist" (false negative)
156-
GitHub's search API has indexing delays for newly created repos. The evaluator patches handle this with direct API calls, but occasional failures may occur.
157-
158-
### Rate limiting
159-
If you hit GitHub API rate limits, wait a few minutes or use a token with higher limits.
160-
161-
### Tests pass but some checks fail
162-
Review the `*_readable.log` files in the output directory for detailed execution traces.

0 commit comments

Comments
 (0)