This temporary guide is for quickly testing ATIF evaluation flows in the simple_web_query_eval example.
ATIF evaluation uses canonical trajectory samples (workflow_output_atif.json) so evaluators can score model outputs using
both final responses and structured agent-step context in a consistent format.
- ATIF built-in evaluators (RAGAS + trajectory lane)
- ATIF custom evaluator (
atif_cosine_similarity)
From the repo root:
uv pip install -e examples/evaluation_and_profiling/simple_web_query_eval
export NVIDIA_API_KEY=<YOUR_API_KEY>simple_web_query is pulled in as a dependency of simple_web_query_eval.
Run:
nat eval --config_file examples/evaluation_and_profiling/simple_web_query_eval/configs/eval_config_atif.ymlNote
Other ATIF config files are also available for different models (for example eval_config_llama31_atif.yml and eval_config_llama33_atif.yml).
Expected output directory:
./.tmp/nat/examples/evaluation_and_profiling/simple_web_query_eval/atif/
Expected key files:
workflow_output.jsonworkflow_output_atif.jsonaccuracy_output.jsongroundedness_output.jsonrelevance_output.jsontrajectory_accuracy_output.json
Run:
nat eval --config_file examples/evaluation_and_profiling/simple_web_query_eval/configs/eval_config_atif_custom_evaluator.ymlExpected output directory:
./.tmp/nat/examples/evaluation_and_profiling/simple_web_query_eval/atif_custom_evaluator/
Expected key files:
workflow_output.jsonworkflow_output_atif.jsonatif_cosine_similarity_eval_output.json
Notes:
- The custom evaluator is ATIF-only and registered from
nat_simple_web_query_eval. - It scores using token cosine similarity and includes trajectory metadata (
trajectory_tool_call_count) in reasoning.
For agent workflows with tool calls, you can use Ragas multi-turn metrics such as AgentGoalAccuracyWithoutReference
and ToolCallAccuracy. These require sample_type: multi_turn and enable_atif_evaluator: true in the evaluator config:
eval:
evaluators:
agent_goal:
_type: ragas
metric: AgentGoalAccuracyWithoutReference
llm_name: nim_rag_eval_llm
enable_atif_evaluator: true
sample_type: multi_turnMulti-turn mode converts ATIF trajectory steps into a Ragas MultiTurnSample with HumanMessage,
AIMessage (with ToolCall), and ToolMessage sequences, preserving the full agent interaction history.
Compare two run directories:
python packages/nvidia_nat_eval/scripts/compare_eval_runs.py \
--run_a ./.tmp/nat/examples/evaluation_and_profiling/simple_web_query_eval/atif \
--run_b ./.tmp/nat/examples/evaluation_and_profiling/simple_web_query_eval/atif_custom_evaluatorThis is mostly useful to verify file presence/differences, since evaluator sets differ between these two configuration files.