Skip to content

Align task_navigation_efficiency override with SDK 1.16.7 input aliases#5022

Draft
Copilot wants to merge 2 commits into
mainfrom
copilot/port-evaluator-logic-improvements
Draft

Align task_navigation_efficiency override with SDK 1.16.7 input aliases#5022
Copilot wants to merge 2 commits into
mainfrom
copilot/port-evaluator-logic-improvements

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented May 11, 2026

This PR complements the parallel version-bump/cleanup work by porting the remaining evaluator behavior gap found in kept builtin overrides after comparing them against azure-ai-evaluation 1.16.7. The only concrete SDK→assets delta in scope was task_navigation_efficiency, whose assets override still required actions / expected_actions instead of also accepting the SDK's response / ground_truth.

  • What changed

    • Updated assets/evaluators/builtin/task_navigation_efficiency/evaluator/_task_navigation_efficiency.py to accept SDK-compatible aliases:
      • responseactions
      • ground_truthexpected_actions
    • Normalization happens before validation and evaluation, so both naming conventions follow the same code path.
    • Existing assets-specific inputs remain supported; this is additive compatibility, not a contract break.
  • Behavioral coverage

    • Extended assets/evaluators/tests/test_evaluators_behavior/test_task_navigation_efficiency_evaluator_behavior.py to cover:
      • direct response / ground_truth inputs
      • JSON-stringified response / ground_truth inputs
  • Scope

    • No azure-ai-evaluation version pin changes.
    • No evaluator removals.
    • No changes to output schema or scoring behavior.
evaluator = TaskNavigationEfficiencyEvaluator()

# Existing assets form
evaluator(
    actions=[...],
    expected_actions=["search", "summarize"],
)

# SDK-compatible form now accepted too
evaluator(
    response=[...],
    ground_truth=["search", "summarize"],
)
Original prompt

Context

Azure/azure-sdk-for-python recently published azure-ai-evaluation==1.16.7 (2026-05-11). Between the version this repo currently consumes (1.11.1 / ~=1.0.1) and the new 1.16.7, the SDK gained many evaluator-logic fixes and features (see CHANGELOG entries for 1.12.0, 1.13.0–1.13.7, 1.14.0, 1.15.0–1.15.2, 1.16.0–1.16.7 at https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/evaluation/azure-ai-evaluation/CHANGELOG.md).

A separate PR (#TBD by the user, see the version-bump PR opened in parallel) handles the version bump and removal of redundant asset overrides. This PR is the complement: for any evaluator override that is still kept in assets/evaluators/builtin/ after the cleanup (i.e. cases where assets has its own implementation), this PR ports the SDK 1.12.0→1.16.7 evaluator-logic improvements that have NOT been reflected in the assets override yet.

Goal

Open ONE PR that:

  1. For each evaluator override still present under assets/evaluators/builtin/<name>/evaluator/:
    • Compare its implementation against the SDK 1.16.7 counterpart at Azure/azure-sdk-for-python@main:sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_<name>/.
    • Identify all evaluator-logic changes that exist in the SDK but are missing from the assets override (e.g. new is_reasoning_model parameter on CoherenceEvaluator/FluencyEvaluator/SimilarityEvaluator/GroundednessEvaluator/RetrievalEvaluator/RelevanceEvaluator from 1.12.0; standardized output schema from 1.13.0; agent-evaluator string-input support from 1.16.1; TaskAdherence + Relevance variance/perf improvements from 1.10.0; F1 score fix from 1.13.7; deflection_rate inverse-metric fix from 1.16.7; etc.).
    • Apply those improvements to the asset override, preserving any asset-only logic that does not yet exist in the SDK (that asset-only logic is being upstreamed by a parallel PR in Azure/azure-sdk-for-python).
  2. Update tests under assets/evaluators/builtin/tests/ and assets/evaluators/tests/ to cover the ported behavior.
  3. Do not change azure-ai-evaluation version pins — that is the responsibility of the parallel version-bump PR.

How to enumerate gaps

For each assets/evaluators/builtin/<name>/evaluator/_<name>.py, fetch the equivalent SDK file and produce a structured gap list with three buckets:

  • SDK-has, assets-missing → port these into this PR.
  • assets-has, SDK-missing → leave alone (handled by the SDK PR).
  • equivalent → no-op.

Concrete evaluators known to be overridden in assets and likely needing porting work:

  • coherence, customer_satisfaction, task_completion, task_adherence, relevance, fluency, groundedness, intent_resolution, retrieval, similarity, response_completeness, tool_call_accuracy, tool_call_success, tool_input_accuracy, tool_output_utilization, tool_selection, task_navigation_efficiency, ungrounded_attributes, code_vulnerability, protected_material, indirect_attack, deflection_rate, hate_unfairness, self_harm, sexual, violence, document_retrieval, f1_score, bleu_score, gleu_score, meteor_score, rouge_score, prohibited_actions, sensitive_data_leakage, quality_grader, python_grader, score_model, label_model, string_check, regex_match, bbeh, ifeval, text_similarity.

Focus areas (from CHANGELOG diff 1.12.0 → 1.16.7) most likely to require porting into asset overrides:

  • 1.13.0 standardized output schema: ensure all asset evaluators emit {name} (score), gpt_{name} (alias), {name}_result (pass/fail), {name}_reason, {name}_threshold, {name}_prompt_tokens, {name}_completion_tokens, {name}_total_tokens, {name}_finish_reason, {name}_model, {name}_sample_input, {name}_sample_output. Many asset overrides predate this and emit only gpt_<name> / <name>_reason.
  • 1.12.0: is_reasoning_model parameter on Coherence/Fluency/Similarity/Groundedness/Retrieval/Relevance.
  • 1.16.1: agentic evaluators (Groundedness, ToolCallAccuracy, ToolCallSuccess, ToolInputAccuracy, ToolOutputUtilization, ToolSelection) accept plain string inputs directly.
  • 1.13.5: TaskAdherence — treat tool_definitions as optional.
  • 1.13.7: F1 score fix.
  • 1.16.7: _TaskNavigationEfficiencyEvaluator accepts JSON-stringified response/ground_truth; deflection_rate inverse-metric pass/fail fix; task_adherence red-team scoring scenario=redteam.
  • 1.14.0: CodeVulnerability + UngroundedAttributes binary true/false scoring; Groundedness query path honors is_reasoning_model + credential.
  • 1.11.0: TokenCredentials on LLM-based evaluators; Groundedness agent tool-call support.

Test strategy

  • Update or add tests under assets/evaluators/builtin/tests/ for each ported change. Assertions should match the SDK 1.16.7 behavior (which is now the source of truth).
  • Run pytest assets/evaluators/builtin/tests and `pytest assets/evaluators/...

This pull request was created from Copilot chat.

Copilot AI changed the title [WIP] Port evaluator logic improvements from SDK 1.12.0 to 1.16.7 Align task_navigation_efficiency override with SDK 1.16.7 input aliases May 11, 2026
Copilot AI requested a review from m7md7sien May 11, 2026 18:31
@github-actions
Copy link
Copy Markdown

This pull request has been marked as stale because it has been inactive for 14 days.

@github-actions github-actions Bot added the Stale label May 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants