Skip to content

Latest commit

 

History

History
116 lines (80 loc) · 3.7 KB

File metadata and controls

116 lines (80 loc) · 3.7 KB

ATIF Eval Temporary Testing Guide

This temporary guide is for quickly testing ATIF evaluation flows in the simple_web_query_eval example. ATIF evaluation uses canonical trajectory samples (workflow_output_atif.json) so evaluators can score model outputs using both final responses and structured agent-step context in a consistent format.

Scope

  • ATIF built-in evaluators (RAGAS + trajectory lane)
  • ATIF custom evaluator (atif_cosine_similarity)

Prerequisites

From the repo root:

uv pip install -e examples/evaluation_and_profiling/simple_web_query_eval
export NVIDIA_API_KEY=<YOUR_API_KEY>

simple_web_query is pulled in as a dependency of simple_web_query_eval.

1) Test ATIF built-in evaluators

Run:

nat eval --config_file examples/evaluation_and_profiling/simple_web_query_eval/configs/eval_config_atif.yml

Note

Other ATIF config files are also available for different models (for example eval_config_llama31_atif.yml and eval_config_llama33_atif.yml).

Expected output directory:

./.tmp/nat/examples/evaluation_and_profiling/simple_web_query_eval/atif/

Expected key files:

  • workflow_output.json
  • workflow_output_atif.json
  • accuracy_output.json
  • groundedness_output.json
  • relevance_output.json
  • trajectory_accuracy_output.json

2) Test ATIF custom evaluator only

Run:

nat eval --config_file examples/evaluation_and_profiling/simple_web_query_eval/configs/eval_config_atif_custom_evaluator.yml

Expected output directory:

./.tmp/nat/examples/evaluation_and_profiling/simple_web_query_eval/atif_custom_evaluator/

Expected key files:

  • workflow_output.json
  • workflow_output_atif.json
  • atif_cosine_similarity_eval_output.json

Notes:

  • The custom evaluator is ATIF-only and registered from nat_simple_web_query_eval.
  • It scores using token cosine similarity and includes trajectory metadata (trajectory_tool_call_count) in reasoning.

3) Test ATIF multi-turn Ragas evaluators

For agent workflows with tool calls, you can use Ragas multi-turn metrics such as AgentGoalAccuracyWithoutReference and ToolCallAccuracy. These require sample_type: multi_turn and enable_atif_evaluator: true in the evaluator config:

eval:
  evaluators:
    agent_goal:
      _type: ragas
      metric: AgentGoalAccuracyWithoutReference
      llm_name: nim_rag_eval_llm
      enable_atif_evaluator: true
      sample_type: multi_turn

Multi-turn mode converts ATIF trajectory steps into a Ragas MultiTurnSample with HumanMessage, AIMessage (with ToolCall), and ToolMessage sequences, preserving the full agent interaction history.

4) Optional quick compare

Compare two run directories:

python packages/nvidia_nat_eval/scripts/compare_eval_runs.py \
  --run_a ./.tmp/nat/examples/evaluation_and_profiling/simple_web_query_eval/atif \
  --run_b ./.tmp/nat/examples/evaluation_and_profiling/simple_web_query_eval/atif_custom_evaluator

This is mostly useful to verify file presence/differences, since evaluator sets differ between these two configuration files.