Toolcall Evaluation #45

clojj · 2026-02-19T19:30:02Z

clojj
Feb 19, 2026

Hi,

we think that for tool-calling agents, it would be good to evaluate/assert toolcalls (names and possibly inputs)

This of course depends on the existence of toolcall capturing in the app-under-test and some interface to capture them in Dokimos.

fkapsahili · 2026-02-21T17:12:45Z

fkapsahili
Feb 21, 2026
Maintainer

Hi @clojj, thanks for opening the discussion! I think too that this would be very valuable and will be highly relevant as agentic applications mature 🙂.

This is already on the roadmap (you can also see "agent tool use" listed under planned evaluators in the README) 😄. The core idea would be that users capture tool calls from their agent under test (tool names, inputs, and optionally outputs) and provide them as part of the EvalTestCase, for example via the metadata map or a new dedicated field. Dokimos would then offer evaluators that let you assert on those tool calls, such as verifying that specific tools were invoked, that the correct sequence of tools was followed, or that tool inputs match expected values.

I think there are a few interesting evaluation dimensions here, for example tool call validity (was the tool called in the right way) and tool correctness (did the agent select the right tool for the given task). These could be separate evaluators that each focus on one aspect of tool use quality.

I'll post an update here once I started working on this. If you have specific patterns or assertions in mind that would be useful for your use case, feel free to share them here!

8 replies

fkapsahili Feb 28, 2026
Maintainer

@clojj! Sure! :)

So in general, tool results are captured in the AgentTrace but the evaluators mostly focus on what the agent decided to call rather than what came back. Here's how it breaks down:

ToolCallValidityEvaluator: This checks that tool calls match their JSON schema (names, required params, types) and doesn't care about tool call results.
ToolCorrectnessEvaluator: This checks whether the agent picked the right set of tools and doesn't look at the tool results either.
TaskCompletionEvaluator This uses a judge LM to assess whether the user's tasks were completed. This one does see the full conversation including tool results, since they inform whether the task was actually done or not.
ToolArgumentHallucinationEvaluator: This checks that argument values are grounded in the user input and not made up by the LLM and also doesn't look at the tool results.

On mocking tool calls: I think that's exactly the right approach and that tool execution is your app's responsibility -> the evaluators care about the agent's decisions (which tools, which arguments are used at execution-time), and not the backend. In our own integration tests we use canned responses too. And I think for mocking tool calls in your use case you just want to feed results back to the model so it can continue the conversation naturally.

The data model is: capture tool calls into ToolCall objects (name + arguments + result), wrap them in an AgentTrace, and pass to evaluators via EvalTestCase. See the agent evaluation docs -> there's an OpenAI integration section showing the full loop, and a fully working example.

Does that make sense?

And thanks for considering the project for your team's use case! Would love to see the project being used by more people in the JVM community 😄 .

clojj Feb 28, 2026
Author

And I think for mocking tool calls in your use case you just want to feed results back to the model so it can continue the conversation naturally

Did you mean feeding mocked results back?

fkapsahili Feb 28, 2026
Maintainer

@clojj Yes exactly! You would mock the tool execution (return canned/fake results) but still feed those results back to the model as tool call responses so it can continue the conversation loop naturally. The model doesn't know or care that the results are mocked. What matters for evaluation is that the agent chose the right tools with the right arguments, not what those tools actually returned from your backend.

clojj Feb 28, 2026
Author

Ok, but keeping mock results in sync wirh reality is a challenge too.
Could you give us a pointer (example, repo link) how to setup mocked Tools & results with Dokimos?

Also, dokimos Tests can be run as @SpringBootTest?

Nice that Tool Evaluation landed!

fkapsahili Feb 28, 2026
Maintainer

Keeping mocks in sync is a real challenge and not something Dokimos solves for you. I think that's fundamentally an application concern. What I'd suggest: keep your canned tool responses close to your test cases (same file or test class) so they're easy to update when your tool contracts change. If your tools have OpenAPI specs, you could generate mock responses from those.

On the setup with mocked tools; the key thing to understand is that Dokimos doesn't know about your agent framework. Your agent returns whatever it returns (an OpenAI ChatCompletion, Response, a LangChain4j AiMessage, your own domain type), and you write a few lines of mapping code to build an AgentTrace from it. Here's what that could look like:

@Test
void agentShouldPickCorrectTools() {
    // 1. Run your actual agent (tool backends can be mocked, but the
    //    agent's decision-making runs for real against an LLM)
    MyAgentResponse response = myAgent.handle("Fly JFK to Paris, book hotel for 3 nights");

    // 2. Map your app's types to Dokimos (this is the glue you write once)
    var traceBuilder = AgentTrace.builder().finalResponse(response.getText());
    for (var call : response.getToolCalls()) {
        traceBuilder.addToolCall(ToolCall.builder()
            .name(call.getFunctionName())
            .arguments(call.getArgs())
            .result(call.getResult())
            .build());
    }
    AgentTrace trace = traceBuilder.build();

    // 3. Evaluate: did the agent call the right tools with valid arguments?
    var testCase = EvalTestCase.builder()
        .input("Fly JFK to Paris, book hotel for 3 nights")
        .actualOutput("toolCalls", trace.toolCalls())
        .expectedOutput("toolCalls", List.of(
            ToolCall.of("search_flights", Map.of()),
            ToolCall.of("book_hotel", Map.of())))
        .metadata("tools", toolDefinitions)
        .build();

    assertThat(ToolCallValidityEvaluator.builder().build().evaluate(testCase).success()).isTrue();
    assertThat(ToolCorrectnessEvaluator.builder().build().evaluate(testCase).score()).isEqualTo(1.0);
}

Step 2 is framework specific; we have a full OpenAI example that shows this bridge for the OpenAI Java SDK, and an offline example with hardcoded traces for quick local testing. The agent evaluation docs cover all six evaluators and the data model.

On @SpringBootTest; yes, evaluators are plain Java objects so they will work in any test context. The dokimos-junit module also gives you @DatasetSource for parameterized tests driven by a dataset file:

@SpringBootTest
class AgentEvaluationTest {

    @Autowired
    MyAgent agent;

    @ParameterizedTest
    @DatasetSource("classpath:datasets/agent-scenarios.json")
    void shouldHandleScenario(Example example) {
        var response = agent.handle(example.input());
        AgentTrace trace = mapToTrace(response); // your mapping code
        var testCase = example.toTestCase(trace.toOutputMap());
        Assertions.assertEval(testCase, evaluators);
    }
}

If you end up building this with your Spring setup and have a pattern that works well, a contribution to dokimos-examples would be great! A Spring Boot agent evaluation example is something that's missing today and I think other teams would find it useful.

jettro · 2026-03-17T11:14:05Z

jettro
Mar 17, 2026

I am trying this approach as well. I have an agent with two tools. One to search for a whisky using the name and another one to use the result from the first tool to get a url to a detail page and ask for more information. The issue is with the testCase. The url and the found name of the whisky is not in the input from the user, therefore the ToolArgumentHallucinationEvaluator fails with the message:

The user did not provide or reference any specific URL; choosing the specific 'togouchi-beer-cask' page is not derivable from the user's input and thus is not grounded.

I understand this, but is it possible to add the output from the first tool as input or context to the second tool for this evaluator? Or should I create two evaluators?

3 replies

fkapsahili Mar 18, 2026
Maintainer

@jettro That's a great question and thanks for trying this out!

I think that's a genuine limitation of the current ToolArgumentHallucinationEvaluator right now. The evaluator only considers the original user input as valid grounding context and in a multi-step agent flow where a tool's arguments come from a previous tool call's results, the evaluator would flag them as hallucinated because they aren't really derived from what the user's input was. That's technically correct from the evaluator's perspective.

What I'd recommend here it to only evaluate tool calls whose arguments come directly from the user input. In your case that's the whisky search, not the detail lookup. You could then build an EvalTestCase with just those first-step tool calls:

var firstStepCalls = trace.toolCalls().stream()
      .filter(tc -> tc.name().equals("search_whisky"))
      .toList();
      
var testCase = EvalTestCase.builder()                                                                                                                                                                     
      .input("Tell me about Togouchi Beer Cask whisky")
      .actualOutput("toolCalls", firstStepCalls)                                                                                                                                                            
      .build();                                                                                                                                                                                             
                                                                                                                                                                                                            
var result = ToolArgumentHallucinationEvaluator.builder()                                                                                                                                                 
      .judge(judge)                                                                                                                                                                                         
      .build()                                                                                                                                                                                              
      .evaluate(testCase);

And to validate the overall flow, TaskCompletionEvaluator might be a better fit in general. It sees the full conversation including tool results, so it should be able to assess whether the agent completed the task end-to-end or not.

So I think you don't need to create two evaluators! 😄

And in general: This is an actual gap that should be addressed. The evaluator should be able to fully understand multi-step grounding, where tool results from step N are valid context for step N+1's arguments :).

Let me know if that helps!

fkapsahili Mar 23, 2026
Maintainer

@jettro This should now be fixed in the release 0.14.2 through #58. Tool call results are now included as grounding-context in the judge prompt, so that chained arguments should be correctly recognized:

ToolCall.builder()                                                                                                                                                                                  
      .name("search_products")                              
      .argument("query", "lightweight running shoe")
      .result("[{\"id\": \"PRD-4821\", \"name\": \"UltraLight Runner\"}]")
      .build(),                                                                                                                                                                                       
ToolCall.builder()
      .name("get_product_details")                                                                                                                                                                    
      .argument("product_id", "PRD-4821") // this is from the tool result above
      .build()

Please let me know if it works for you!

jettro Mar 23, 2026

Works like a charm, thanks.

Toolcall Evaluation #45

Uh oh!

clojj Feb 19, 2026

Replies: 2 comments · 11 replies

Uh oh!

Uh oh!

fkapsahili Feb 21, 2026 Maintainer

Uh oh!

Uh oh!

fkapsahili Feb 28, 2026 Maintainer

Uh oh!

clojj Feb 28, 2026 Author

Uh oh!

fkapsahili Feb 28, 2026 Maintainer

Uh oh!

Uh oh!

clojj Feb 28, 2026 Author

Uh oh!

fkapsahili Feb 28, 2026 Maintainer

Uh oh!

jettro Mar 17, 2026

Uh oh!

fkapsahili Mar 18, 2026 Maintainer

Uh oh!

fkapsahili Mar 23, 2026 Maintainer

Uh oh!

jettro Mar 23, 2026

clojj
Feb 19, 2026

Replies: 2 comments 11 replies

fkapsahili
Feb 21, 2026
Maintainer

fkapsahili Feb 28, 2026
Maintainer

clojj Feb 28, 2026
Author

fkapsahili Feb 28, 2026
Maintainer

clojj Feb 28, 2026
Author

fkapsahili Feb 28, 2026
Maintainer

jettro
Mar 17, 2026

fkapsahili Mar 18, 2026
Maintainer

fkapsahili Mar 23, 2026
Maintainer