You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ToolCallSuccess: move runtime-status short-circuit from prompt into Python
Failed/incomplete tool_call or tool_result blocks now return a deterministic fail result without invoking the LLM judge; the prompty rubric is consulted only on the success path. Drops [STATUS] suffix from the formatted LLM input (back-compat with pre-pass-through wire format). Adds _collect_failed_tool_calls helper and _return_short_circuit_failure_result method; removes _format_status_suffix; rewrites tests.
Copy file name to clipboardExpand all lines: sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_success/_tool_call_success.py
Copy file name to clipboardExpand all lines: sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_success/tool_call_success.prompty
-57Lines changed: 0 additions & 57 deletions
Original file line number
Diff line number
Diff line change
@@ -53,15 +53,13 @@ B. Examine tool result and definition for the tool being called to check whether
53
53
1. A tool result is **failed** if **any** of the following ERROR-CASES applies to it:
54
54
ERROR-CASES:
55
55
===========
56
-
- The tool call or tool result line is annotated with **`[STATUS] failed`** or **`[STATUS] incomplete`**. These annotations indicate the tool call did not produce a usable result -- either because the runtime explicitly marked the call `failed` (an exception in the tool, the API surface returned an error response) or because the call was interrupted before completion (e.g. host timeout, parent-response cancellation surfaced as `incomplete`). They are strong, authoritative failure signals and override any contradictory appearance of the result payload.
57
56
- The tool call resulted in an error or exception
58
57
- The tool call failed to run or failed to return
59
58
- The tool call returned a result that indicates an error or failure
60
59
- The tool call returned an object or JSON string that has one or more of its fields indicating an error
61
60
- The tool timed-out or returned a result that indicate a time-out
62
61
- The tool result does not make sense, from technical perspective, not business perspective, given the definition of that tool, if the definition is present
63
62
2. If none of the error cases apply to the tool result , it is considered **succeeded** even if the tool result itself indicates a business mistake
64
-
3. The `[STATUS]` annotation is **optional**. When it is absent on a tool call, judge that call by the payload-based rules above (back-compat with runtimes that do not emit a status field). When it is present and indicates success (e.g. `[STATUS] completed`), it does not by itself make a call succeed -- still apply the payload-based rules, because a runtime can report `completed` while the tool itself returned an error payload.
65
63
C. If one or more tool result are **failed** , then you the **evaluation process** has **failed**, otherwise , the **evaluation process** has **succeeded**
66
64
D. You are required to return your **output** in the following format:
67
65
{
@@ -337,61 +335,6 @@ EXPECTED OUTPUT
337
335
}
338
336
339
337
340
-
### Example - Failed (status annotation overrides bland payload)
"reason": "send_email is annotated with [STATUS] failed on both the call and the result, which is an authoritative failure signal from the runtime even though the result body {} is otherwise inconclusive",
349
-
"properties": {
350
-
"failed_tools": "send_email"
351
-
},
352
-
"score": 0,
353
-
"status": "completed"
354
-
}
355
-
356
-
357
-
### Example - Failed (status completed but payload still indicates an error)
[TOOL_RESULT] {"UserName":"", "UserEmail":"", "Message":"failed to get current user information"} [STATUS] completed
362
-
363
-
EXPECTED OUTPUT
364
-
{
365
-
"reason": "The runtime reported [STATUS] completed but the result payload still indicates failure with empty fields and an explicit error message. Payload-based rules still apply when [STATUS] is completed -- this call is failed",
366
-
"properties": {
367
-
"failed_tools": "get_current_user_info"
368
-
},
369
-
"score": 0,
370
-
"status": "completed"
371
-
}
372
-
373
-
374
-
### Example - Failed (parallel calls in one turn, one annotated failed)
"reason": "send_email is annotated with [STATUS] failed; the other two parallel calls succeeded but a single failed call is sufficient to fail the overall evaluation",
387
-
"properties": {
388
-
"failed_tools": "send_email"
389
-
},
390
-
"score": 0,
391
-
"status": "completed"
392
-
}
393
-
394
-
395
338
396
339
Now given the **INPUT** you received generate the output
0 commit comments