update README.md

vinamra57 · vinamra57 · commit be5a5a01bc75 · 2025-12-31T15:07:54.000-08:00
diff --git a/tests/benchmarks/mcp_universe/README.md b/tests/benchmarks/mcp_universe/README.md
@@ -4,9 +4,9 @@ This directory contains the integration of the MCP-Universe repository managemen
 
 ## Overview
 
-MCP-Universe is a comprehensive benchmark from Salesforce AI Research that evaluates LLMs on realistic tasks using real-world MCP servers. This integration focuses on the **repository management domain** with:
+MCP-Universe is a comprehensive benchmark from Salesforce AI Research that evaluates LLMs on realistic tasks using real-world MCP servers. This integration focuses on the repository management domain with:
 
-- **28 pure GitHub tasks** (github_task_0001 through github_task_0030, excluding 0013 and 0020)
+- 28 GitHub tasks
 - Tests realistic GitHub operations including:
   - Creating repositories and branches
   - Managing files and commits
@@ -18,22 +18,21 @@ MCP-Universe is a comprehensive benchmark from Salesforce AI Research that evalu
 
 ### Prerequisites
 
-1. **Docker** - Required to run the GitHub MCP server
+1. Docker - Required to run the GitHub MCP server
    - Install Docker Desktop: https://www.docker.com/products/docker-desktop
-   - **Start Docker Desktop** before running tests
+   - Start Docker Desktop before running tests
    - Verify: `docker --version`
    - Using pinned version v0.15.0 for research reproducibility
+   - If you have multiple versions of the GitHub MCP server image, ensure v0.15.0 is tagged as `latest` or is the only version installed
 
-2. **GitHub Personal Access Token** - For GitHub API access
-   - **CRITICAL**: Use a dedicated test GitHub account for safety
+2. GitHub Personal Access Token - For GitHub API access
+   - Use a test GitHub account for safety
    - Create token: https://github.com/settings/tokens
-   - Required scopes: `repo`, `delete_repo`
+   - Required scopes: All scopes
 
-3. **LLM API Key**
-   - OpenAI API key for GPT models, OR
-   - Anthropic API key for Claude models
+3. LLM API Key
 
-4. **Python 3.13+** with [uv](https://docs.astral.sh/uv/)
+4. Python 3.13+ with [uv](https://docs.astral.sh/uv/)
 
 ### Installation
 
@@ -51,37 +50,32 @@ docker pull ghcr.io/github/github-mcp-server:v0.15.0
 
 ### Environment Variables
 
-**Required** - tests will fail without these:
-
 ```bash
+# Required; tests will fail without these
+# Use a test GitHub account; the agent performs real operations
 export GITHUB_PERSONAL_ACCESS_TOKEN="your_github_token"
 export GITHUB_PERSONAL_ACCOUNT_NAME="your_github_username"
-```
-
-**LLM API Key** - one of these depending on model:
 
-```bash
-# For OpenAI models (gpt-4o, gpt-4o-mini, etc.)
+# LLM API Key
+# For OpenAI models (gpt-5, gpt-4o, gpt-4o-mini, etc.)
 export OPENAI_API_KEY="your_openai_key"
 
 # For Anthropic models (claude-sonnet-4-5, etc.)
 export ANTHROPIC_API_KEY="your_anthropic_key"
 ```
 
-**IMPORTANT**: Use a dedicated test GitHub account. The agent performs real operations including creating and deleting repositories.
-
 ### Running Tests
 
 Run all 28 tasks:
 
 ```bash
-pytest tests/benchmarks/mcp_universe/test_mcp_universe.py --model gpt-4o-mini -v
+pytest tests/benchmarks/mcp_universe/test_mcp_universe.py --model gpt-5 -v
 ```
 
 Run a single task:
 
 ```bash
-pytest tests/benchmarks/mcp_universe/test_mcp_universe.py::test_mcp_universe[github_task_0001] --model gpt-4o-mini -v
+pytest tests/benchmarks/mcp_universe/test_mcp_universe.py::test_mcp_universe[github_task_0001] --model gpt-5 -v
 ```
 
 Run with different models:
@@ -90,7 +84,7 @@ Run with different models:
 # GPT-4o
 pytest tests/benchmarks/mcp_universe/test_mcp_universe.py --model gpt-4o
 
-# Claude Sonnet
+# Claude Sonnet 4.5
 pytest tests/benchmarks/mcp_universe/test_mcp_universe.py --model claude-sonnet-4-5
 ```
 
@@ -102,6 +96,19 @@ pytest tests/benchmarks/mcp_universe/test_mcp_universe.py --model claude-sonnet-
 | `--temperature` | `0.001` | Temperature for LLM sampling |
 | `--output-dir` | `outputs` | Base directory for outputs (logs written to `{output_dir}/raw/`) |
 | `--validate-only` | - | Skip agent execution, only run evaluation against live GitHub |
+| `--toolset` | `full` | Tool availability: `full` (all 93 tools) or `minimal` (19 essential tools) |
+
+### Toolset Comparison
+
+The `--toolset` flag allows comparing agent performance with different tool availability:
+
+```bash
+# Full toolset (default): All 93 GitHub MCP tools
+pytest tests/benchmarks/mcp_universe/ --toolset full -v
+
+# Minimal toolset: 19 essential tools derived from successful test runs
+pytest tests/benchmarks/mcp_universe/ --toolset minimal -v
+```
 
 ### Validate Mode
 
@@ -127,36 +134,10 @@ This is useful if you previously ran the agent and want to re-check the GitHub s
 | `instruction.txt` | System instruction for the agent |
 | `reporting.py` | Human-readable log formatting |
 
-### Environment Variables
-
-| Variable | Used By | Purpose |
-|----------|---------|---------|
-| `GITHUB_PERSONAL_ACCESS_TOKEN` | MCP Server, Evaluator | GitHub API authentication |
-| `GITHUB_PERSONAL_ACCOUNT_NAME` | Evaluator | Template substitution in task assertions |
-| `OPENAI_API_KEY` | FastAgent | OpenAI model access |
-| `ANTHROPIC_API_KEY` | FastAgent | Anthropic model access |
-
 ### MCP Server Configuration
 
 The GitHub MCP server runs in Docker:
 - Image: `ghcr.io/github/github-mcp-server:v0.15.0`
 - Required env var: `GITHUB_PERSONAL_ACCESS_TOKEN`
 
 Only the access token is passed to the Docker container. The account name is used locally by the evaluator for template substitution in task assertions (e.g., checking `{{GITHUB_PERSONAL_ACCOUNT_NAME}}/repo-name` exists).
-
-## Troubleshooting
-
-### "Docker not found"
-Ensure Docker Desktop is running and restart your terminal.
-
-### "GITHUB_PERSONAL_ACCESS_TOKEN environment variable not set"
-Export the required environment variables before running tests.
-
-### "repository doesn't exist" (false negative)
-GitHub's search API has indexing delays for newly created repos. The evaluator patches handle this with direct API calls, but occasional failures may occur.
-
-### Rate limiting
-If you hit GitHub API rate limits, wait a few minutes or use a token with higher limits.
-
-### Tests pass but some checks fail
-Review the `*_readable.log` files in the output directory for detailed execution traces.