Add MCP eval harness for multi-model tool-use testing by ryanjoneil · Pull Request #273 · nextmv-io/nextmv-py

ryanjoneil · 2026-04-08T10:38:22Z

Summary

Add eval harness (nextmv/evals/) for testing LLM tool-use against the MCP server
YAML eval cases define tasks, expected tools, success criteria, and deterministic replay scripts
Provider-agnostic Agent protocol with AnthropicAgent and OpenAIAgent (Ollama-compatible) implementations
Tool groups (tool_groups.yaml) scope tools per case — solved 0/5 → 5/5 pass rate on smaller models
System prompt with tool category guide improves tool discovery across all model sizes
30 deterministic tests (CI-safe), LLM tests marked @pytest.mark.llm (on-demand)
GitHub Actions workflow: deterministic on PR, Claude evals via workflow_dispatch

Eval results (local Ollama):

gpt-oss:20b: 5/5 app management cases
gemma4:26b: 5/5 app management cases (4/5 before tool groups fix)

Test plan

pytest tests/evals/ -m "not llm" — 30 deterministic tests pass
pytest tests/evals/ -m llm — LLM tests run with Ollama or API keys
Existing 551 tests unaffected

- Add `tools` field to eval cases for scoping tools per case - Add EVAL_SYSTEM_PROMPT with tool category guide - Add OpenAI and Anthropic provider agents - Runner.get_tools() filters to case-scoped tools - Both gpt-oss:20b and gemma4:26b pass 5/5 app management evals Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add tool_groups.yaml defining 15 tool categories (app, run, version, instance, scenario, batch, ensemble, etc.) - Loader resolves tool_groups into flat tool lists with dedup - Cases use `tool_groups: [app]` instead of listing tools individually - Both gpt-oss:20b and gemma4:26b pass 5/5 and 4/5 with 7 scoped tools (vs 0/5 with all 89 tools) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…I support Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- runner.py: wrap call_tool in try/except to handle tool errors - loader.py: raise ValueError on unknown tool group names - scorer.py: str() content before substring check - anthropic.py: add API error handling, document single-use - Remove TestOpenAIEvals class (keep OpenAI provider for Ollama) - Simplify CI workflow to deterministic + Claude only Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ryanjoneil and others added 12 commits April 7, 2026 22:42

chore: add evals dependency group (anthropic, openai SDKs)

81378c4

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat(evals): add tool schema bridge, Agent protocol, and Deterministi…

e5adb3c

…cAgent Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat(evals): add YAML eval case loader and eval cases

baf4119

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat(evals): add eval runner with agent-MCP tool loop

3a04143

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat(evals): add LLM eval test file with Ollama, Anthropic, and OpenA…

484d8e9

…I support Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ci: add GitHub Actions workflow for MCP evals

728a34f

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix(evals): use OLLAMA_HOST env var for configurable Ollama URL

ba4e32c

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Move plan doc to agents repo where design docs are managed

f6172b7

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MCP eval harness for multi-model tool-use testing#273

Add MCP eval harness for multi-model tool-use testing#273
ryanjoneil wants to merge 12 commits into
refactor/mcp-shared-actionsfrom
feature/mcp-evals

ryanjoneil commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ryanjoneil commented Apr 8, 2026

Summary

Test plan

Related

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant