Skip to content

Commit 9ce24e3

Browse files
committed
feat(llm): tool-call assertions and agent-run analysis MCP tools
Add deterministic, read-only analysis of an agent run reconstructed from the LLM requests MockServer recorded — no LLM call, fully reproducible. - AgentRunAnalyzer (mockserver-core, org.mockserver.llm.analysis): decodes recorded requests with the provider codec, treats the richest conversation (most messages = latest snapshot) as canonical, and exposes inspectToolCalls (count assistant tool calls by name + optional args regex) and summarise (message/assistant-turn counts, ordered tool-call sequence, tool-result IDs, latest message role). Pure and offline. - verify_tool_call MCP tool: assert an agent called a named tool atLeast/atMost times, optionally with arguments matching a regex. - explain_agent_run MCP tool: structural summary of a recorded run. Both retrieve recorded requests via /mockserver/retrieve and delegate to AgentRunAnalyzer; validate provider + params (atMost >= atLeast). The dashboard surfacing of this analysis is roadmap item #11 (correlated call-graph view), which builds on AgentRunAnalyzer — so this phase is backend + MCP + docs. Docs: docs/code/llm-mocking.md (Agent-run analysis section + tools + source refs), consumer AI/MCP tools page (two new tool sections), roadmap status, changelog. Tests: 7 AgentRunAnalyzerTest (counts, args-regex filter, run summary, richest snapshot, empty/non-decodable, tool-result correlation) + 6 LlmMcpToolsTest (verify satisfied/unsatisfied, args filter, missing toolName, explain with and without recorded conversation). Core + netty tests green.
1 parent 2ed88ee commit 9ce24e3

8 files changed

Lines changed: 669 additions & 6 deletions

File tree

changelog.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77
## [Unreleased]
88

99
### Added
10+
- Added two MCP tools for **agent-run analysis and tool-call assertions**, both backed by a new deterministic `org.mockserver.llm.analysis.AgentRunAnalyzer` that reconstructs an agent run by decoding the LLM requests MockServer recorded. `verify_tool_call` asserts that an agent called a named tool a given number of times (`atLeast`/`atMost`, with an optional regex over the tool-call arguments); `explain_agent_run` summarises the run's structure (message and assistant-turn counts, the ordered tool-call sequence, tool results, and the latest message role). Read-only and offline — no LLM call. See the AI/MCP tools page and `docs/code/llm-mocking.md`.
1011
- Added a **runtime-LLM client SPI** (`org.mockserver.llm.client`) that lets MockServer call a real LLM you already run, as the foundation for opt-in features such as drift detection and exploratory semantic matching. Mirrors the existing codec registry: an `LlmClient` per provider (Ollama, OpenAI, OpenAI Responses, Azure OpenAI, Anthropic, Gemini, Bedrock) registered in `LlmClientRegistry`, an immutable `LlmBackend` config (with the API key redacted in logs), and a three-layer `LlmBackendResolver` (provider env vars → `mockserver.llmProvider`/`llmApiKey`/`llmModel`/`llmBaseUrl` → named-backends JSON via `mockserver.llmBackendsConfig`). All runtime-LLM use goes through `LlmCompletionService`, which is **off unless a backend is configured**, **fails closed** on any timeout/error/non-2xx (never flipping a deterministic result), and caches per normalised prompt for reproducibility. Ollama is the reference backend (no key, local); Bedrock builds the Anthropic-on-Bedrock request and relies on the `headers` escape hatch pending automatic SigV4 signing. See the configuration properties page and `docs/code/llm-mocking.md`.
1112
- LLM conversation mocks can now opt into deterministic **prompt normalisation** before the `latestMessageContains` / `latestMessageMatches` predicates are evaluated, so a match is not blocked by cosmetic differences in dynamically-assembled agent prompts. A new `normalization` block on `conversationPredicates` (also exposed per-turn in the `create_llm_conversation` MCP tool and the dashboard conversation wizard) supports collapsing whitespace, lowercasing, sorting JSON object keys, dropping built-in volatile values (ISO-8601 timestamps, UUIDs, `req_`/`msg_`/`call_` ids), and dropping named JSON fields. Normalisation is pure and idempotent — it never makes a test flaky — and has no effect unless a text predicate is set. See the AI/MCP tools page and `docs/code/llm-mocking.md`.
1213
- DataFaker (`net.datafaker:datafaker:2.5.4`) is now bundled as a template helper. A single shared `Faker` instance is exposed as `faker` in all three response-template engines (Velocity, Mustache, JavaScript) via `TemplateFunctions.BUILT_IN_HELPERS`, giving templates access to 250+ realistic-fake-data providers (`faker.name().firstName()`, `faker.internet().emailAddress()`, `faker.address().city()`, etc.). The instance is thread-safe and produces fresh random values on each call. See the consumer docs (response templates page) for the full provider list and per-engine syntax. Java 17 unlocked this — DataFaker 2.x requires Java 17; the previous Java 11 floor pinned us to the abandoned 1.9.0 line.

docs/code/llm-mocking.md

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -162,8 +162,19 @@ Two MCP tools expose the LLM mocking feature to agents:
162162
|------|-------------|
163163
| `mock_llm_completion` | Creates a single LLM expectation from provider, path, text, tool calls, usage |
164164
| `create_llm_conversation` | Creates a multi-turn conversation with scenario state chain, optional isolation, and an optional per-turn `match.normalization` object |
165+
| `verify_tool_call` | Asserts an agent called a named tool `atLeast`/`atMost` times (optional args regex), by decoding recorded LLM requests |
166+
| `explain_agent_run` | Summarises a recorded agent run: turn/tool-call sequence, tool results, latest role |
165167

166-
Both validate provider availability against `ProviderCodecRegistry` at registration time.
168+
The first two validate provider availability against `ProviderCodecRegistry` at registration time. The analysis tools delegate to `org.mockserver.llm.analysis.AgentRunAnalyzer`.
169+
170+
## Agent-run analysis
171+
172+
`AgentRunAnalyzer` (`org.mockserver.llm.analysis`) is a deterministic, read-only inspector. Given the LLM requests MockServer recorded (retrieved via the normal request log), it decodes each with the provider's `ProviderCodec` and treats the **richest** conversation (most messages — the latest dialogue snapshot) as the canonical run. From that it derives:
173+
174+
- `inspectToolCalls(requests, provider, toolName, argsRegex)` → count + matched tool calls (powers `verify_tool_call`).
175+
- `summarise(requests, provider)` → message count, assistant-turn count, ordered tool-call name sequence, tool-result keys, latest message role (powers `explain_agent_run`).
176+
177+
No LLM is called and no network is used — it reads the structure the codecs already produce, so assertions are reproducible. The MCP tools are thin wrappers that retrieve recorded requests (`/mockserver/retrieve?type=REQUESTS`) and format the analyzer's output. The dashboard surfacing of this analysis is the correlated call-graph view (roadmap item #11).
167178

168179
## Dashboard Rendering
169180

@@ -334,3 +345,4 @@ Key source files under `mockserver/mockserver-core/src/main/java/org/mockserver/
334345
| `llm/client/LlmBackendResolver.java` | Three-layer backend resolution (env / properties / named JSON) |
335346
| `llm/client/LlmCompletionService.java` | Orchestrator: off-unless-configured, fail-closed, cached |
336347
| `llm/client/LlmTransport.java` + `NettyHttpClientLlmTransport.java` | Transport seam over `NettyHttpClient` |
348+
| `llm/analysis/AgentRunAnalyzer.java` | Deterministic read-only agent-run inspection (tool-call counts, run summary) |

docs/plans/mockserver-llm-mocking.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,8 @@ The original RFC (RFC-1 LLM Response Builder + RFC-2 Stateful Scripted Conversat
1515
|---|---|---|
1616
| 1 | LLM response builder (`llmMock`) — RFC-1 | ✅ Shipped (M0–M5) |
1717
| 2 | Stateful scripted conversations — RFC-2 Layer B | ✅ Shipped (M2) |
18-
| 3 | Tool-call assertions (`verify_tool_call`) | ❌ Not started |
19-
| 4 | Agent-run / LLM-session analysis (`explain_agent_run`) | ❌ Not started |
18+
| 3 | Tool-call assertions (`verify_tool_call`) | ✅ Shipped — `verify_tool_call` MCP tool over `AgentRunAnalyzer` (decodes recorded requests; asserts a named tool was called atLeast/atMost times, optional args regex) |
19+
| 4 | Agent-run / LLM-session analysis (`explain_agent_run`) | ✅ Shipped — `explain_agent_run` MCP tool (turn/tool-call sequence, tool results, latest role). UI surfacing is item #11 (call-graph view) |
2020

2121
### Tier 2 — high value
2222

jekyll-www.mock-server.com/mock_server/ai_mcp_tools.html

Lines changed: 70 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
---
22
title: MCP Tools Reference
3+
description: Full reference for every MCP tool exposed by MockServer, including create_expectation, verify_request, retrieve, debug, OpenAPI, and LLM mocking tools.
34
shortTitle: MCP Tools Reference
45
layout: page
56
pageOrder: 2
@@ -45,6 +46,8 @@ <h2>Tool Overview</h2>
4546
<tr><td><a href="#load_expectations_from_file"><code>load_expectations_from_file</code></a></td><td>Load expectations from a fixture file for replay</td><td>High</td></tr>
4647
<tr><td><a href="#mock_llm_completion"><code>mock_llm_completion</code></a></td><td>Create a single-turn LLM completion expectation for any supported provider</td><td>High</td></tr>
4748
<tr><td><a href="#create_llm_conversation"><code>create_llm_conversation</code></a></td><td>Create a multi-turn scripted LLM conversation with optional per-session isolation</td><td>High</td></tr>
49+
<tr><td><a href="#verify_tool_call"><code>verify_tool_call</code></a></td><td>Assert an agent called a named tool, from recorded LLM requests</td><td>High</td></tr>
50+
<tr><td><a href="#explain_agent_run"><code>explain_agent_run</code></a></td><td>Summarise a recorded agent run (turns, tool-call sequence)</td><td>High</td></tr>
4851
<tr><td><a href="#raw_expectation"><code>raw_expectation</code></a></td><td>Full expectation JSON passthrough</td><td>Low</td></tr>
4952
<tr><td><a href="#raw_retrieve"><code>raw_retrieve</code></a></td><td>Full retrieve with correlation ID filtering</td><td>Low</td></tr>
5053
<tr><td><a href="#raw_verify"><code>raw_verify</code></a></td><td>Full verification JSON</td><td>Low</td></tr>
@@ -536,7 +539,7 @@ <h3>stop_server</h3>
536539

537540
<h3>create_expectation_from_openapi</h3>
538541

539-
<p>Generate mock expectations from an <a target="_blank" href="https://swagger.io/docs/specification/basic-structure/">OpenAPI v3</a> specification. MockServer will create one expectation per operation in the specification, using example responses where available.</p>
542+
<p>Generate mock expectations from an <a target="_blank" href="https://swagger.io/docs/specification/basic-structure/" rel="noopener noreferrer">OpenAPI v3</a> specification. MockServer will create one expectation per operation in the specification, using example responses where available.</p>
540543

541544
<table>
542545
<thead>
@@ -1093,6 +1096,70 @@ <h3>create_llm_conversation</h3>
10931096

10941097
<p>The <code>scenarioName</code> in the response is auto-generated and encodes the isolation key. The <code>states</code> array shows the scenario state progression: <code>Started</code> &rarr; <code>turn_1</code> &rarr; <code>__done</code>. Each concurrent session identified by a distinct <code>x-session-id</code> header value advances through its own copy of this state chain.</p>
10951098

1099+
<a id="verify_tool_call" class="anchor" href="#verify_tool_call">&nbsp;</a>
1100+
1101+
<h3>verify_tool_call</h3>
1102+
1103+
<p>Assert that an agent called a particular tool, by decoding the LLM requests MockServer recorded and inspecting the assistant tool calls in the conversation. Deterministic and read-only &mdash; it does not call any LLM. Useful for testing that your agent decided to use the expected tool (and, optionally, with the expected arguments).</p>
1104+
1105+
<table>
1106+
<thead>
1107+
<tr><th>Parameter</th><th>Type</th><th>Required</th><th>Description</th></tr>
1108+
</thead>
1109+
<tbody>
1110+
<tr><td><code>provider</code></td><td>string</td><td>Yes</td><td>LLM provider whose recorded requests to inspect (e.g. <code>ANTHROPIC</code>, <code>OPENAI</code>)</td></tr>
1111+
<tr><td><code>toolName</code></td><td>string</td><td>Yes</td><td>Name of the tool the agent should have called</td></tr>
1112+
<tr><td><code>path</code></td><td>string</td><td>No</td><td>Restrict to requests on this path (e.g. <code>/v1/messages</code>)</td></tr>
1113+
<tr><td><code>argumentsRegex</code></td><td>string</td><td>No</td><td>Java regex matched against the tool call's argument JSON</td></tr>
1114+
<tr><td><code>atLeast</code></td><td>integer</td><td>No</td><td>Minimum matching calls required (default 1)</td></tr>
1115+
<tr><td><code>atMost</code></td><td>integer</td><td>No</td><td>Maximum matching calls allowed</td></tr>
1116+
</tbody>
1117+
</table>
1118+
1119+
<p>The result reports <code>count</code> (matching tool calls found) and <code>satisfied</code> (whether the count met the <code>atLeast</code>/<code>atMost</code> constraints); when not satisfied it includes a human-readable <code>message</code>.</p>
1120+
1121+
<pre class="prettyprint code"><code class="code">{
1122+
"jsonrpc": "2.0",
1123+
"id": 40,
1124+
"method": "tools/call",
1125+
"params": {
1126+
"name": "verify_tool_call",
1127+
"arguments": {
1128+
"provider": "ANTHROPIC",
1129+
"path": "/v1/messages",
1130+
"toolName": "get_weather",
1131+
"argumentsRegex": "Paris",
1132+
"atLeast": 1
1133+
}
1134+
}
1135+
}</code></pre>
1136+
1137+
<a id="explain_agent_run" class="anchor" href="#explain_agent_run">&nbsp;</a>
1138+
1139+
<h3>explain_agent_run</h3>
1140+
1141+
<p>Summarise an agent run reconstructed from recorded LLM requests &mdash; a quick way to see what an agent did without reading raw request bodies. Returns the message count, the number of assistant turns, the ordered sequence of tool-call names (<code>toolCallSequence</code>), the tool-use IDs a result was returned for (<code>toolResultsFor</code>, e.g. <code>"toolu_1"</code>), and the role of the latest message. Deterministic and read-only.</p>
1142+
1143+
<table>
1144+
<thead>
1145+
<tr><th>Parameter</th><th>Type</th><th>Required</th><th>Description</th></tr>
1146+
</thead>
1147+
<tbody>
1148+
<tr><td><code>provider</code></td><td>string</td><td>Yes</td><td>LLM provider whose recorded requests to summarise</td></tr>
1149+
<tr><td><code>path</code></td><td>string</td><td>No</td><td>Restrict to requests on this path</td></tr>
1150+
</tbody>
1151+
</table>
1152+
1153+
<pre class="prettyprint code"><code class="code">{
1154+
"jsonrpc": "2.0",
1155+
"id": 41,
1156+
"method": "tools/call",
1157+
"params": {
1158+
"name": "explain_agent_run",
1159+
"arguments": { "provider": "ANTHROPIC", "path": "/v1/messages" }
1160+
}
1161+
}</code></pre>
1162+
10961163
<a id="low_level_tools" class="anchor" href="#low_level_tools">&nbsp;</a>
10971164

10981165
<h2>Low-Level Tools</h2>
@@ -1110,7 +1177,7 @@ <h3>raw_expectation</h3>
11101177
<tr><th>Parameter</th><th>Type</th><th>Required</th><th>Description</th></tr>
11111178
</thead>
11121179
<tbody>
1113-
<tr><td><code>expectation</code></td><td>object</td><td>Yes</td><td>Full expectation JSON as defined in the <a target="_blank" href="https://app.swaggerhub.com/apis/jamesdbloom/mock-server-openapi">REST API specification</a></td></tr>
1180+
<tr><td><code>expectation</code></td><td>object</td><td>Yes</td><td>Full expectation JSON as defined in the <a target="_blank" href="https://app.swaggerhub.com/apis/jamesdbloom/mock-server-openapi" rel="noopener noreferrer">REST API specification</a></td></tr>
11141181
</tbody>
11151182
</table>
11161183

@@ -1196,7 +1263,7 @@ <h3>raw_verify</h3>
11961263
<tr><th>Parameter</th><th>Type</th><th>Required</th><th>Description</th></tr>
11971264
</thead>
11981265
<tbody>
1199-
<tr><td><code>verification</code></td><td>object</td><td>Yes</td><td>Full verification JSON as defined in the <a target="_blank" href="https://app.swaggerhub.com/apis/jamesdbloom/mock-server-openapi">REST API specification</a></td></tr>
1266+
<tr><td><code>verification</code></td><td>object</td><td>Yes</td><td>Full verification JSON as defined in the <a target="_blank" href="https://app.swaggerhub.com/apis/jamesdbloom/mock-server-openapi" rel="noopener noreferrer">REST API specification</a></td></tr>
12001267
</tbody>
12011268
</table>
12021269

0 commit comments

Comments
 (0)