Skip to content

Add MCP eval harness for multi-model tool-use testing#273

Draft
ryanjoneil wants to merge 12 commits into
refactor/mcp-shared-actionsfrom
feature/mcp-evals
Draft

Add MCP eval harness for multi-model tool-use testing#273
ryanjoneil wants to merge 12 commits into
refactor/mcp-shared-actionsfrom
feature/mcp-evals

Conversation

@ryanjoneil
Copy link
Copy Markdown
Member

Summary

  • Add eval harness (nextmv/evals/) for testing LLM tool-use against the MCP server
  • YAML eval cases define tasks, expected tools, success criteria, and deterministic replay scripts
  • Provider-agnostic Agent protocol with AnthropicAgent and OpenAIAgent (Ollama-compatible) implementations
  • Tool groups (tool_groups.yaml) scope tools per case — solved 0/5 → 5/5 pass rate on smaller models
  • System prompt with tool category guide improves tool discovery across all model sizes
  • 30 deterministic tests (CI-safe), LLM tests marked @pytest.mark.llm (on-demand)
  • GitHub Actions workflow: deterministic on PR, Claude evals via workflow_dispatch

Eval results (local Ollama):

  • gpt-oss:20b: 5/5 app management cases
  • gemma4:26b: 5/5 app management cases (4/5 before tool groups fix)

Test plan

  • pytest tests/evals/ -m "not llm" — 30 deterministic tests pass
  • pytest tests/evals/ -m llm — LLM tests run with Ollama or API keys
  • Existing 551 tests unaffected

Related

🤖 Generated with Claude Code

ryanjoneil and others added 12 commits April 7, 2026 22:42
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…cAgent

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces the auto-generated scorer with the planned version:
- Order-independent tool matching (not ordered subsequence)
- "contains" key for substring matching (matches YAML case format)
- 11 dedicated scorer tests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add `tools` field to eval cases for scoping tools per case
- Add EVAL_SYSTEM_PROMPT with tool category guide
- Add OpenAI and Anthropic provider agents
- Runner.get_tools() filters to case-scoped tools
- Both gpt-oss:20b and gemma4:26b pass 5/5 app management evals

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add tool_groups.yaml defining 15 tool categories (app, run,
  version, instance, scenario, batch, ensemble, etc.)
- Loader resolves tool_groups into flat tool lists with dedup
- Cases use `tool_groups: [app]` instead of listing tools individually
- Both gpt-oss:20b and gemma4:26b pass 5/5 and 4/5 with 7 scoped tools
  (vs 0/5 with all 89 tools)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…I support

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- runner.py: wrap call_tool in try/except to handle tool errors
- loader.py: raise ValueError on unknown tool group names
- scorer.py: str() content before substring check
- anthropic.py: add API error handling, document single-use
- Remove TestOpenAIEvals class (keep OpenAI provider for Ollama)
- Simplify CI workflow to deterministic + Claude only

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant