feat(evaluator): Evidence handles, ATIF traces and trace normalisation#432
feat(evaluator): Evidence handles, ATIF traces and trace normalisation#432arpitsardhana wants to merge 1 commit into
Conversation
|
55ce862 to
15882d4
Compare
| """Return the normalized trace, reading and parsing on first access.""" | ||
| if self._trace is None: | ||
| payload = await asyncio.to_thread(self._load_payload) | ||
| self._trace = normalize_trace(payload, source_format=self._descriptor.format) |
There was a problem hiding this comment.
fabric?
Trace handle need to understand source format
|
|
||
|
|
||
| # (kind_attr, tool_kinds, tool_name_attr, prompt_attr, completion_attr) | ||
| _OTEL_KEYS = ( |
There was a problem hiding this comment.
Add codex and Claude format
See jsonl based multi -turn conversations
948dc8a to
9c8325f
Compare
|
Caution Review failedAn error occurred during the review process. Please try again later. 📝 WalkthroughWalkthroughTrace evidence is normalized to ATIF during trial creation and descriptor generation. Evidence handles expose normalized traces and logs. Example metrics now use evidence constants and add a retry-loop metric. Tests cover normalization, log access, timeout cleanup, and the new metric. ChangesATIF trace evidence and retry metrics
Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 OpenGrep (1.23.0)packages/nemo_evaluator_sdk/src/nemo_evaluator_sdk/agent_eval/trials.py┌──────────────┐ �[32m✔�[39m �[1mOpengrep OSS�[0m [00.49][ERROR]: unable to find a config; path packages/nemo_evaluator_sdk/tests/agent_eval/test_example_metrics.py┌──────────────┐ �[32m✔�[39m �[1mOpengrep OSS�[0m [00.15][ERROR]: unable to find a config; path packages/nemo_evaluator_sdk/tests/agent_eval/test_evidence.py┌──────────────┐ �[32m✔�[39m �[1mOpengrep OSS�[0m [00.13][ERROR]: unable to find a config; path
Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
♻️ Duplicate comments (1)
packages/nemo_evaluator_sdk/src/nemo_evaluator_sdk/agent_eval/trials.py (1)
161-168: 🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy liftThis strips source format for file-based traces.
standard_evidence_descriptors()can now only emitatiforjson, so OTEL/OpenInference exports passed viatrace_pathcan never hit their normalizers on this path. This needs an explicit trace-format input instead of filename heuristics.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/nemo_evaluator_sdk/src/nemo_evaluator_sdk/agent_eval/trials.py` around lines 161 - 168, The trace descriptor normalization is using filename heuristics in standard_evidence_descriptors() / the trace_path handling, which causes file-based OTEL/OpenInference traces to lose their original format. Update the evidence-building flow to accept an explicit trace-format input and pass that through when constructing the EvidenceDescriptor for EVIDENCE_TRACE, instead of inferring “atif” vs “json” from is_atif or the path name. Ensure normalize_trace_descriptor and the caller in trials.py preserve the source format so the correct normalizer can run.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/nemo_evaluator_sdk/examples/run_agent_eval/example_metrics.py`:
- Around line 19-21: The retry detection in the example metrics logic is
counting global frequency instead of consecutive repeats, so update the relevant
code in example_metrics.py to measure the longest streak of identical calls in
the trace rather than total occurrences. In the call-counting path around the
metrics computation (the logic that uses EVIDENCE_TRACE), canonicalize nested
args before comparing so semantically equivalent dicts with different insertion
order map to the same payload, and keep the streak bounded to adjacent entries
only. Use the existing symbols EVIDENCE_TRACE, EVIDENCE_INITIAL_STATE, and
EVIDENCE_FINAL_STATE to locate the trace-processing code and adjust the retry
metric accordingly.
In `@packages/nemo_evaluator_sdk/src/nemo_evaluator_sdk/values/evidence.py`:
- Around line 490-510: normalize_trace() currently drops any list-shaped payload
unless source_format is explicitly otel/openinference, which can erase live
evaluator trajectory evidence. Update normalize_trace() in evidence.py to
preserve unknown list traces by either routing them through the appropriate
list-based normalizer or explicitly validating and rejecting unsupported list
shapes instead of returning an empty AtifTrace(). Keep the existing dict
handling for steps/events and ensure the fallback path for bare lists is handled
safely rather than silently discarding data.
- Around line 534-541: The ref-handling in `_local_filesystem_ref()` / the
`descriptor.ref is not None` branch currently raises for remote refs before the
`is_file()` no-op path can run, which conflicts with the documented behavior.
Update the logic so non-local or unresolvable refs are detected and returned
unchanged in `descriptor.model_copy` flow, and only call
`_local_filesystem_ref()` after confirming the ref is local or otherwise safe to
resolve. Keep the existing `normalize_trace` and `.atif.json` write path for
local files, but preserve the unchanged descriptor behavior for remote refs.
- Around line 671-682: The `Evidence.trace()` accessor is currently using
`self.require(name)` without verifying the descriptor kind, so a non-trace entry
can be wrapped as a `TraceHandle` and look valid. Update `trace()` to enforce
`kind="trace"` before constructing the handle, matching the stricter contract
already implied by `TraceHandle` and the existing `filesystem()`/`logs()`
accessors. Keep the cache behavior in `_trace_cache`, but ensure only true trace
descriptors are accepted and anything else fails immediately.
In `@packages/nemo_evaluator_sdk/tests/agent_eval/test_evidence.py`:
- Around line 5-7: The process cleanup assertion in test_evidence.py is using a
fixed 200 ms wait, which can make the test flaky if the child is briefly a
zombie on a busy runner. Update the test around the cleanup check to poll
`os.kill(pid, 0)` until a short deadline instead of sleeping once, so the
assertion waits for reaping to complete. Use the existing process-tree
cleanup/test logic in `test_evidence` to locate the affected check and keep the
polling window small.
---
Duplicate comments:
In `@packages/nemo_evaluator_sdk/src/nemo_evaluator_sdk/agent_eval/trials.py`:
- Around line 161-168: The trace descriptor normalization is using filename
heuristics in standard_evidence_descriptors() / the trace_path handling, which
causes file-based OTEL/OpenInference traces to lose their original format.
Update the evidence-building flow to accept an explicit trace-format input and
pass that through when constructing the EvidenceDescriptor for EVIDENCE_TRACE,
instead of inferring “atif” vs “json” from is_atif or the path name. Ensure
normalize_trace_descriptor and the caller in trials.py preserve the source
format so the correct normalizer can run.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 739e49cd-e370-4b40-991d-c3b8b5189fc5
⛔ Files ignored due to path filters (4)
sdk/python/nemo-platform/src/nemo_platform/beta/evaluator/agent_eval/evaluator.pyis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/beta/evaluator/agent_eval/trials.pyis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/beta/evaluator/values/__init__.pyis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/beta/evaluator/values/evidence.pyis excluded by!sdk/**
📒 Files selected for processing (7)
packages/nemo_evaluator_sdk/examples/run_agent_eval/example_metrics.pypackages/nemo_evaluator_sdk/src/nemo_evaluator_sdk/agent_eval/evaluator.pypackages/nemo_evaluator_sdk/src/nemo_evaluator_sdk/agent_eval/trials.pypackages/nemo_evaluator_sdk/src/nemo_evaluator_sdk/values/__init__.pypackages/nemo_evaluator_sdk/src/nemo_evaluator_sdk/values/evidence.pypackages/nemo_evaluator_sdk/tests/agent_eval/test_evidence.pypackages/nemo_evaluator_sdk/tests/agent_eval/test_example_metrics.py
| from collections.abc import Sequence | ||
|
|
||
| from nemo_evaluator_sdk.agent_eval.trials import EVIDENCE_FINAL_STATE, EVIDENCE_INITIAL_STATE, EVIDENCE_TRACE |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Count retry streaks, not global frequency.
Lines 118-121 count identical calls across the entire trace. That flags legitimate reuse separated by other work as a retry loop, and the current key still splits semantically identical nested args when dict insertion order differs. Track the longest consecutive run from a canonicalized call payload instead.
Suggested fix
+import json
from collections.abc import Sequence
@@
async def compute_scores(self, input: MetricInput) -> MetricResult:
max_repeats = 0
evidence = input.candidate.evidence
if evidence is not None and evidence.get(self._evidence_name) is not None:
calls = await (await evidence.trace(self._evidence_name)).tool_calls()
- counts: dict[str, int] = {}
+ previous_key: str | None = None
+ current_repeats = 0
for call in calls:
- key = f"{call.name}:{sorted((call.arguments or {}).items())}"
- counts[key] = counts.get(key, 0) + 1
- max_repeats = max(counts.values(), default=0)
+ key = json.dumps(
+ {"name": call.name, "arguments": call.arguments or {}},
+ sort_keys=True,
+ separators=(",", ":"),
+ )
+ current_repeats = current_repeats + 1 if key == previous_key else 1
+ previous_key = key
+ max_repeats = max(max_repeats, current_repeats)
return MetricResult(
outputs=[
MetricOutput(name="efficient_tool_use", value=max_repeats <= self._threshold),
MetricOutput(name="max_repeated_tool_calls", value=max_repeats),
]Also applies to: 112-125
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/nemo_evaluator_sdk/examples/run_agent_eval/example_metrics.py`
around lines 19 - 21, The retry detection in the example metrics logic is
counting global frequency instead of consecutive repeats, so update the relevant
code in example_metrics.py to measure the longest streak of identical calls in
the trace rather than total occurrences. In the call-counting path around the
metrics computation (the logic that uses EVIDENCE_TRACE), canonicalize nested
args before comparing so semantically equivalent dicts with different insertion
order map to the same payload, and keep the streak bounded to adjacent entries
only. Use the existing symbols EVIDENCE_TRACE, EVIDENCE_INITIAL_STATE, and
EVIDENCE_FINAL_STATE to locate the trace-processing code and adjust the retry
metric accordingly.
| def normalize_trace(payload: Any, *, source_format: str | None = None) -> AtifTrace: | ||
| """Normalize a raw trace payload into an :class:`AtifTrace`. | ||
|
|
||
| ``source_format`` (from the evidence descriptor) routes to a span normalizer | ||
| when set to ``otel``/``openinference``; otherwise the payload shape decides: | ||
| a NAT trajectory (``steps``), an already-normalized trace (``events``), or a | ||
| bare span list. | ||
| """ | ||
| if isinstance(payload, AtifTrace): | ||
| return payload | ||
| fmt = (source_format or "").lower() | ||
| if fmt in {"otel", "opentelemetry"} and isinstance(payload, list): | ||
| return normalize_otel_spans(payload) | ||
| if fmt == "openinference" and isinstance(payload, list): | ||
| return normalize_openinference_spans(payload) | ||
| if isinstance(payload, dict): | ||
| if "steps" in payload: | ||
| return normalize_nat_trajectory(payload) | ||
| if "events" in payload: | ||
| return AtifTrace.model_validate(payload) | ||
| return AtifTrace() |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | 🏗️ Heavy lift
Unknown list traces are dropped here.
normalize_trace() returns AtifTrace() for any list unless source_format is otel/openinference. The live evaluator path still feeds list-shaped trajectory payloads, so this change can erase trace evidence instead of normalizing or rejecting it.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/nemo_evaluator_sdk/src/nemo_evaluator_sdk/values/evidence.py` around
lines 490 - 510, normalize_trace() currently drops any list-shaped payload
unless source_format is explicitly otel/openinference, which can erase live
evaluator trajectory evidence. Update normalize_trace() in evidence.py to
preserve unknown list traces by either routing them through the appropriate
list-based normalizer or explicitly validating and rejecting unsupported list
shapes instead of returning an empty AtifTrace(). Keep the existing dict
handling for steps/events and ensure the fallback path for bare lists is handled
safely rather than silently discarding data.
| if descriptor.ref is not None: | ||
| source = _local_filesystem_ref(descriptor.ref) | ||
| if not source.is_file(): | ||
| return descriptor | ||
| trace = normalize_trace(_read_trace_payload(descriptor), source_format=descriptor.format) | ||
| atif_path = source.with_name(f"{source.stem}.atif.json") | ||
| atif_path.write_text(trace.model_dump_json(), encoding="utf-8") | ||
| return descriptor.model_copy(update={"format": "atif", "ref": str(atif_path), "data": None}) |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Remote trace refs now throw instead of no-oping.
The docstring says unresolvable refs are returned unchanged, but _local_filesystem_ref() raises on non-local refs before the is_file() guard runs.
Proposed fix
if descriptor.ref is not None:
- source = _local_filesystem_ref(descriptor.ref)
+ try:
+ source = _local_filesystem_ref(descriptor.ref)
+ except ValueError:
+ return descriptor
if not source.is_file():
return descriptor
trace = normalize_trace(_read_trace_payload(descriptor), source_format=descriptor.format)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if descriptor.ref is not None: | |
| source = _local_filesystem_ref(descriptor.ref) | |
| if not source.is_file(): | |
| return descriptor | |
| trace = normalize_trace(_read_trace_payload(descriptor), source_format=descriptor.format) | |
| atif_path = source.with_name(f"{source.stem}.atif.json") | |
| atif_path.write_text(trace.model_dump_json(), encoding="utf-8") | |
| return descriptor.model_copy(update={"format": "atif", "ref": str(atif_path), "data": None}) | |
| if descriptor.ref is not None: | |
| try: | |
| source = _local_filesystem_ref(descriptor.ref) | |
| except ValueError: | |
| return descriptor | |
| if not source.is_file(): | |
| return descriptor | |
| trace = normalize_trace(_read_trace_payload(descriptor), source_format=descriptor.format) | |
| atif_path = source.with_name(f"{source.stem}.atif.json") | |
| atif_path.write_text(trace.model_dump_json(), encoding="utf-8") | |
| return descriptor.model_copy(update={"format": "atif", "ref": str(atif_path), "data": None}) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/nemo_evaluator_sdk/src/nemo_evaluator_sdk/values/evidence.py` around
lines 534 - 541, The ref-handling in `_local_filesystem_ref()` / the
`descriptor.ref is not None` branch currently raises for remote refs before the
`is_file()` no-op path can run, which conflicts with the documented behavior.
Update the logic so non-local or unresolvable refs are detected and returned
unchanged in `descriptor.model_copy` flow, and only call
`_local_filesystem_ref()` after confirming the ref is local or otherwise safe to
resolve. Keep the existing `normalize_trace` and `.atif.json` write path for
local files, but preserve the unchanged descriptor behavior for remote refs.
| async def trace(self, name: str = "trace") -> TraceHandle: | ||
| """Return a cached normalized-trace handle for a named trace descriptor. | ||
|
|
||
| Async for consistency with :meth:`filesystem`/:meth:`logs` and to leave room | ||
| for ref staging/validation; the trace itself is read lazily on first access. | ||
| """ | ||
| cached = self._trace_cache.get(name) | ||
| if cached is not None: | ||
| return cached | ||
| handle = TraceHandle(self.require(name)) | ||
| self._trace_cache[name] = handle | ||
| return handle |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Enforce kind="trace" here.
Unlike filesystem() and logs(), this accessor accepts any descriptor kind. That currently turns a non-trace "trace" entry into an empty trace instead of surfacing the contract break.
Proposed fix
- handle = TraceHandle(self.require(name))
+ handle = TraceHandle(self.require(name, kind="trace"))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| async def trace(self, name: str = "trace") -> TraceHandle: | |
| """Return a cached normalized-trace handle for a named trace descriptor. | |
| Async for consistency with :meth:`filesystem`/:meth:`logs` and to leave room | |
| for ref staging/validation; the trace itself is read lazily on first access. | |
| """ | |
| cached = self._trace_cache.get(name) | |
| if cached is not None: | |
| return cached | |
| handle = TraceHandle(self.require(name)) | |
| self._trace_cache[name] = handle | |
| return handle | |
| async def trace(self, name: str = "trace") -> TraceHandle: | |
| """Return a cached normalized-trace handle for a named trace descriptor. | |
| Async for consistency with :meth:`filesystem`/:meth:`logs` and to leave room | |
| for ref staging/validation; the trace itself is read lazily on first access. | |
| """ | |
| cached = self._trace_cache.get(name) | |
| if cached is not None: | |
| return cached | |
| handle = TraceHandle(self.require(name, kind="trace")) | |
| self._trace_cache[name] = handle | |
| return handle |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/nemo_evaluator_sdk/src/nemo_evaluator_sdk/values/evidence.py` around
lines 671 - 682, The `Evidence.trace()` accessor is currently using
`self.require(name)` without verifying the descriptor kind, so a non-trace entry
can be wrapped as a `TraceHandle` and look valid. Update `trace()` to enforce
`kind="trace"` before constructing the handle, matching the stricter contract
already implied by `TraceHandle` and the existing `filesystem()`/`logs()`
accessors. Keep the cache behavior in `_trace_cache`, but ensure only true trace
descriptors are accepted and anything else fails immediately.
| import json | ||
| import os | ||
| from pathlib import Path |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win
Avoid the fixed 200 ms reap window.
Line 168 assumes the killed child will be fully reaped within 200 ms. On a busy runner, os.kill(pid, 0) can still see a zombie briefly and make this test flaky even when the process tree cleanup is correct. Poll to a short deadline instead.
Suggested fix
import json
import os
+import time
from pathlib import Path
@@
child_pid = int(pidfile.read_text().strip())
- await asyncio.sleep(0.2) # let the OS finish reaping the killed group
- with pytest.raises(ProcessLookupError):
- os.kill(child_pid, 0)
+ deadline = time.monotonic() + 2
+ while True:
+ try:
+ os.kill(child_pid, 0)
+ except ProcessLookupError:
+ break
+ if time.monotonic() >= deadline:
+ pytest.fail(f"child process {child_pid} still exists after timeout cleanup")
+ await asyncio.sleep(0.05)Also applies to: 167-170
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/nemo_evaluator_sdk/tests/agent_eval/test_evidence.py` around lines 5
- 7, The process cleanup assertion in test_evidence.py is using a fixed 200 ms
wait, which can make the test flaky if the child is briefly a zombie on a busy
runner. Update the test around the cleanup check to poll `os.kill(pid, 0)` until
a short deadline instead of sleeping once, so the assertion waits for reaping to
complete. Use the existing process-tree cleanup/test logic in `test_evidence` to
locate the affected check and keep the polling window small.
| metadata: dict[str, Any] = Field(default_factory=dict) | ||
|
|
||
|
|
||
| def normalize_nat_trajectory(payload: dict[str, Any]) -> AtifTrace: |
There was a problem hiding this comment.
this normalisation is not accurate. need to be fixed
Add the ATIF (Agent Trace Interchange Format) model and normalizers for NAT trajectories, OpenTelemetry GenAI spans, and OpenInference spans, plus the TraceHandle and LogHandle read handles exposed via CandidateEvidence.trace() and CandidateEvidence.logs(). Includes the example-only inefficient_retry_loop metric that scores over the normalized trace evidence. Vendored mirror synced via make vendor. Signed-off-by: Arpit Singh (SW-CLOUD) <arpsingh@nvidia.com>
9c8325f to
2da7a67
Compare
| initial_name: str = EVIDENCE_INITIAL_STATE, | ||
| final_name: str = EVIDENCE_FINAL_STATE, |
There was a problem hiding this comment.
Question, do you think these need to be injected here? When would the initial/final states have a different name?
| if evidence is not None and evidence.get(self._evidence_name) is not None: | ||
| calls = await (await evidence.trace(self._evidence_name)).tool_calls() |
There was a problem hiding this comment.
Why would the user choose to get different evidence types in the metric? It seems like this always relies on it being EVIDENCE_TRACE, right?
| """Return the normalized ATIF trajectory, reading and parsing on first access.""" | ||
| if self._trajectory is None: | ||
| payload = await asyncio.to_thread(self._load_payload) | ||
| self._trajectory = normalize_trace(payload, source_format=self._descriptor.format) |
There was a problem hiding this comment.
I don't think I quite understand the process in the PR. We end up normalizing the EvidenceDescriptor which converts the payload to atif, but we also call it here to normalize the trajectory. Do we need both? Or am I reading this incorrectly. I'd assume the normalizing happening here is sufficient and we don't need to pre-emptively normalize the evidence descriptor.
| trace = EvidenceDescriptor(kind="trace", format="json", data=sample["trajectory"]) | ||
| # Normalize to ATIF before the trial is persisted so the stored shape is | ||
| # source-agnostic (sources in, ATIF out); TraceHandle then reads it uniformly. | ||
| trace = normalize_trace_descriptor(EvidenceDescriptor(kind="trace", format="json", data=sample["trajectory"])) |
There was a problem hiding this comment.
I don't think this is the right approach tbh. I think persisting the raw trace and convert on metric usage is the right approach. The problem with converting before persisting is that if there's a bug in our conversion code or we don't handle the source format fully, the conversion may be lossy. At least if we convert on read, users can improve the conversion function without loss.
Summary
Adds the portable trace layer for agent-eval evidence so metrics can score off what the agent actually did (the trace and logs), independent of the producing runtime. Rebased onto
mainafter the filesystem handle /run_verifier/ taskset work landed via #447, so this PR is now scoped to traces and logs only.values/evidence.py):AtifTrace/AtifEvent/AtifEventType/AtifTokenUsage— a single normalized, persisted trace format with token accounting.normalize_tracedispatches by descriptorformat/payload shape tonormalize_nat_trajectory,normalize_otel_spans, andnormalize_openinference_spans, so NAT trajectories and OTel/OpenInference spans all map to ATIF.normalize_trace_descriptor(idempotent, preserve-modality) andnormalize_candidate_evidencerun before a trial is persisted — wired into the live inline path (evaluator._trial_from_sample) and the file path (trials.standard_evidence_descriptors). Inlinedatais normalized in place; a filerefis normalized into a sibling*.atif.json. Result: persisted trace evidence is ATIF regardless of source.CandidateEvidence.trace()→TraceHandle(events/tool_calls/token_usage) and.logs()→LogHandle(list_files/read_text/tail), bothasyncand per-trial cached so sibling metrics share a materialized handle.WellKnownEvidenceKeyLiteral (initial_state/trace/logs/final_state/verifier_logs).examples/run_agent_eval/example_metrics.py, example-only):inefficient_retry_loopscoring offTraceHandle.make vendor.Test plan
ruff check+ruff format --checkclean on changed filesty checkcleanpytest packages/nemo_evaluator_sdk/tests/agent_eval/→ 68 passed (NAT/OTel/OpenInference normalization, inline + file-ref persist-time normalization round-trips, trace/log handle access, example metric)make vendorregenerated mirror, no driftNotes
values/evidence.py(no new value module) per review preference.*.atif.json) and is a no-op for already-ATIF or missing-file descriptors.Summary by CodeRabbit
InefficientRetryLoopMetricto detect repeated tool-call loops and report the maximum repeats.