Add MCP eval harness for multi-model tool-use testing#273
Draft
ryanjoneil wants to merge 12 commits into
Draft
Conversation
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…cAgent Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces the auto-generated scorer with the planned version: - Order-independent tool matching (not ordered subsequence) - "contains" key for substring matching (matches YAML case format) - 11 dedicated scorer tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add `tools` field to eval cases for scoping tools per case - Add EVAL_SYSTEM_PROMPT with tool category guide - Add OpenAI and Anthropic provider agents - Runner.get_tools() filters to case-scoped tools - Both gpt-oss:20b and gemma4:26b pass 5/5 app management evals Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add tool_groups.yaml defining 15 tool categories (app, run, version, instance, scenario, batch, ensemble, etc.) - Loader resolves tool_groups into flat tool lists with dedup - Cases use `tool_groups: [app]` instead of listing tools individually - Both gpt-oss:20b and gemma4:26b pass 5/5 and 4/5 with 7 scoped tools (vs 0/5 with all 89 tools) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…I support Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- runner.py: wrap call_tool in try/except to handle tool errors - loader.py: raise ValueError on unknown tool group names - scorer.py: str() content before substring check - anthropic.py: add API error handling, document single-use - Remove TestOpenAIEvals class (keep OpenAI provider for Ollama) - Simplify CI workflow to deterministic + Claude only Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
nextmv/evals/) for testing LLM tool-use against the MCP serverAgentprotocol withAnthropicAgentandOpenAIAgent(Ollama-compatible) implementationstool_groups.yaml) scope tools per case — solved 0/5 → 5/5 pass rate on smaller models@pytest.mark.llm(on-demand)workflow_dispatchEval results (local Ollama):
gpt-oss:20b: 5/5 app management casesgemma4:26b: 5/5 app management cases (4/5 before tool groups fix)Test plan
pytest tests/evals/ -m "not llm"— 30 deterministic tests passpytest tests/evals/ -m llm— LLM tests run with Ollama or API keysRelated
🤖 Generated with Claude Code