ATIF Eval Temporary Testing Guide

This temporary guide is for quickly testing ATIF evaluation flows in the simple_web_query_eval example. ATIF evaluation uses canonical trajectory samples (workflow_output_atif.json) so evaluators can score model outputs using both final responses and structured agent-step context in a consistent format.

Scope

ATIF built-in evaluators (RAGAS + trajectory lane)
ATIF custom evaluator (atif_cosine_similarity)

Prerequisites

From the repo root:

uv pip install -e examples/evaluation_and_profiling/simple_web_query_eval
export NVIDIA_API_KEY=<YOUR_API_KEY>

simple_web_query is pulled in as a dependency of simple_web_query_eval.

1) Test ATIF built-in evaluators

Run:

nat eval --config_file examples/evaluation_and_profiling/simple_web_query_eval/configs/eval_config_atif.yml

Note

Other ATIF config files are also available for different models (for example eval_config_llama31_atif.yml and eval_config_llama33_atif.yml).

Expected output directory:

./.tmp/nat/examples/evaluation_and_profiling/simple_web_query_eval/atif/

Expected key files:

workflow_output.json
workflow_output_atif.json
accuracy_output.json
groundedness_output.json
relevance_output.json
trajectory_accuracy_output.json

2) Test ATIF custom evaluator only

Run:

nat eval --config_file examples/evaluation_and_profiling/simple_web_query_eval/configs/eval_config_atif_custom_evaluator.yml

Expected output directory:

./.tmp/nat/examples/evaluation_and_profiling/simple_web_query_eval/atif_custom_evaluator/

Expected key files:

workflow_output.json
workflow_output_atif.json
atif_cosine_similarity_eval_output.json

Notes:

The custom evaluator is ATIF-only and registered from nat_simple_web_query_eval.
It scores using token cosine similarity and includes trajectory metadata (trajectory_tool_call_count) in reasoning.

3) Test ATIF multi-turn Ragas evaluators

For agent workflows with tool calls, you can use Ragas multi-turn metrics such as AgentGoalAccuracyWithoutReference and ToolCallAccuracy. These require sample_type: multi_turn and enable_atif_evaluator: true in the evaluator config:

eval:
  evaluators:
    agent_goal:
      _type: ragas
      metric: AgentGoalAccuracyWithoutReference
      llm_name: nim_rag_eval_llm
      enable_atif_evaluator: true
      sample_type: multi_turn

Multi-turn mode converts ATIF trajectory steps into a Ragas MultiTurnSample with HumanMessage, AIMessage (with ToolCall), and ToolMessage sequences, preserving the full agent interaction history.

4) Optional quick compare

Compare two run directories:

python packages/nvidia_nat_eval/scripts/compare_eval_runs.py \
  --run_a ./.tmp/nat/examples/evaluation_and_profiling/simple_web_query_eval/atif \
  --run_b ./.tmp/nat/examples/evaluation_and_profiling/simple_web_query_eval/atif_custom_evaluator

This is mostly useful to verify file presence/differences, since evaluator sets differ between these two configuration files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ATIF Eval Temporary Testing Guide

Scope

Prerequisites

1) Test ATIF built-in evaluators

2) Test ATIF custom evaluator only

3) Test ATIF multi-turn Ragas evaluators

4) Optional quick compare

FilesExpand file tree

atif-eval-readme.md

Latest commit

History

atif-eval-readme.md

File metadata and controls

ATIF Eval Temporary Testing Guide

Scope

Prerequisites

1) Test ATIF built-in evaluators

2) Test ATIF custom evaluator only

3) Test ATIF multi-turn Ragas evaluators

4) Optional quick compare