Replies: 2 comments 11 replies
-
|
Hi @clojj, thanks for opening the discussion! I think too that this would be very valuable and will be highly relevant as agentic applications mature 🙂. This is already on the roadmap (you can also see "agent tool use" listed under planned evaluators in the README) 😄. The core idea would be that users capture tool calls from their agent under test (tool names, inputs, and optionally outputs) and provide them as part of the EvalTestCase, for example via the metadata map or a new dedicated field. Dokimos would then offer evaluators that let you assert on those tool calls, such as verifying that specific tools were invoked, that the correct sequence of tools was followed, or that tool inputs match expected values. I think there are a few interesting evaluation dimensions here, for example tool call validity (was the tool called in the right way) and tool correctness (did the agent select the right tool for the given task). These could be separate evaluators that each focus on one aspect of tool use quality. I'll post an update here once I started working on this. If you have specific patterns or assertions in mind that would be useful for your use case, feel free to share them here! |
Beta Was this translation helpful? Give feedback.
-
|
I am trying this approach as well. I have an agent with two tools. One to search for a whisky using the name and another one to use the result from the first tool to get a url to a detail page and ask for more information. The issue is with the testCase. The url and the found name of the whisky is not in the input from the user, therefore the ToolArgumentHallucinationEvaluator fails with the message: The user did not provide or reference any specific URL; choosing the specific 'togouchi-beer-cask' page is not derivable from the user's input and thus is not grounded. I understand this, but is it possible to add the output from the first tool as input or context to the second tool for this evaluator? Or should I create two evaluators? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
we think that for tool-calling agents, it would be good to evaluate/assert toolcalls (names and possibly inputs)
This of course depends on the existence of toolcall capturing in the app-under-test and some interface to capture them in Dokimos.
Beta Was this translation helpful? Give feedback.
All reactions