feat(evaluator): Evidence handles, ATIF traces and trace normalisation by arpitsardhana · Pull Request #432 · NVIDIA-NeMo/nemo-platform

arpitsardhana · 2026-06-24T06:24:09Z

Summary

Adds the portable trace layer for agent-eval evidence so metrics can score off what the agent actually did (the trace and logs), independent of the producing runtime. Rebased onto main after the filesystem handle / run_verifier / taskset work landed via #447, so this PR is now scoped to traces and logs only.

ATIF data model (values/evidence.py): AtifTrace / AtifEvent / AtifEventType / AtifTokenUsage — a single normalized, persisted trace format with token accounting.
Source normalizers: normalize_trace dispatches by descriptor format/payload shape to normalize_nat_trajectory, normalize_otel_spans, and normalize_openinference_spans, so NAT trajectories and OTel/OpenInference spans all map to ATIF.
Persist-time normalization: normalize_trace_descriptor (idempotent, preserve-modality) and normalize_candidate_evidence run before a trial is persisted — wired into the live inline path (evaluator._trial_from_sample) and the file path (trials.standard_evidence_descriptors). Inline data is normalized in place; a file ref is normalized into a sibling *.atif.json. Result: persisted trace evidence is ATIF regardless of source.
Read handles: CandidateEvidence.trace() → TraceHandle (events / tool_calls / token_usage) and .logs() → LogHandle (list_files / read_text / tail), both async and per-trial cached so sibling metrics share a materialized handle.
Well-known keys: WellKnownEvidenceKey Literal (initial_state / trace / logs / final_state / verifier_logs).
Example reference metric (examples/run_agent_eval/example_metrics.py, example-only): inefficient_retry_loop scoring off TraceHandle.
Vendored SDK mirror synced via make vendor.

Test plan

ruff check + ruff format --check clean on changed files
ty check clean
pytest packages/nemo_evaluator_sdk/tests/agent_eval/ → 68 passed (NAT/OTel/OpenInference normalization, inline + file-ref persist-time normalization round-trips, trace/log handle access, example metric)
make vendor regenerated mirror, no drift
CI green

Notes

All additions live in values/evidence.py (no new value module) per review preference.
Persist-time normalization preserves storage modality (inline stays inline; file refs get a sibling *.atif.json) and is a no-op for already-ATIF or missing-file descriptors.

Summary by CodeRabbit

New Features
- Added normalized trace and log evidence support with cached access to trace steps, tool calls, and token usage.
- Introduced InefficientRetryLoopMetric to detect repeated tool-call loops and report the maximum repeats.
Bug Fixes
- Standardized trace evidence descriptors by normalizing them to ATIF format (including file-backed traces).
- Ensured evidence imports and trial construction use normalized trace descriptors consistently.
- Improved verifier timeout cleanup by validating the full process tree is terminated.
Tests
- Added/expanded tests for trace normalization round-trips, trace/log handle behavior, and the retry-loop metric.

github-actions · 2026-06-24T06:38:39Z

Suite	Lines Covered	Line Rate	Branch Rate
Unit Tests	21624/28290	76.4%	61.3%
Integration Tests	12490/27059	46.2%	19.3%

arpitsardhana · 2026-06-25T18:30:10Z

+        """Return the normalized trace, reading and parsing on first access."""
+        if self._trace is None:
+            payload = await asyncio.to_thread(self._load_payload)
+            self._trace = normalize_trace(payload, source_format=self._descriptor.format)


fabric?

Trace handle need to understand source format

arpitsardhana · 2026-06-25T18:33:37Z

+
+
+# (kind_attr, tool_kinds, tool_name_attr, prompt_attr, completion_attr)
+_OTEL_KEYS = (


Add codex and Claude format

See jsonl based multi -turn conversations

coderabbitai · 2026-06-26T07:11:14Z

Caution

Review failed

An error occurred during the review process. Please try again later.

📝 Walkthrough

Walkthrough

Trace evidence is normalized to ATIF during trial creation and descriptor generation. Evidence handles expose normalized traces and logs. Example metrics now use evidence constants and add a retry-loop metric. Tests cover normalization, log access, timeout cleanup, and the new metric.

Changes

ATIF trace evidence and retry metrics

Layer / File(s)	Summary
ATIF data models `packages/nemo_evaluator_sdk/src/nemo_evaluator_sdk/values/atif.py`	ATIF schema models, content validation, subagent references, tool calls, observations, metrics, and agent fields are added.
Trajectory validation and helpers `packages/nemo_evaluator_sdk/src/nemo_evaluator_sdk/values/atif.py`	Trajectory validation, JSON export, sequential step checks, subagent id checks, tool-call reference checks, and multimodal detection are added.
Trace normalization and evidence handles `packages/nemo_evaluator_sdk/src/nemo_evaluator_sdk/values/evidence.py`, `packages/nemo_evaluator_sdk/src/nemo_evaluator_sdk/values/__init__.py`	Trace payload normalization, descriptor rewriting, candidate normalization, trace and log handles, cached accessors, and public evidence exports are added.
Trace descriptor wiring `packages/nemo_evaluator_sdk/src/nemo_evaluator_sdk/agent_eval/evaluator.py`, `packages/nemo_evaluator_sdk/src/nemo_evaluator_sdk/agent_eval/trials.py`	Trial trace evidence and standard evidence descriptors are normalized before storage.
Example metrics `packages/nemo_evaluator_sdk/examples/run_agent_eval/example_metrics.py`	Example metrics switch to evidence constants and add the trace-based retry-loop metric.
Tests `packages/nemo_evaluator_sdk/tests/agent_eval/test_evidence.py`, `packages/nemo_evaluator_sdk/tests/agent_eval/test_example_metrics.py`	Tests cover trace normalization, log access, verifier timeout handling, and the retry-loop metric.

Possibly related PRs

NVIDIA-NeMo/nemo-platform#339: Extends the same CandidateEvidence/EvidenceDescriptor domain with trace normalization and new evidence handle accessors.
NVIDIA-NeMo/nemo-platform#383: Touches the same trial/evidence construction path in agent_eval/trials.py.
NVIDIA-NeMo/nemo-platform#447: Updates examples/run_agent_eval/example_metrics.py with evidence-name wiring and a new metric.

Suggested reviewers

SandyChapman
ngoncharenko

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 53.57% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: evidence handles, ATIF traces, and trace normalization.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch aalgo-258-p0-evidence-metrics/arpsingh

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 OpenGrep (1.23.0)

packages/nemo_evaluator_sdk/src/nemo_evaluator_sdk/agent_eval/trials.py

┌──────────────┐
│ Opengrep CLI │
└──────────────┘

�[32m✔�[39m �[1mOpengrep OSS�[0m
�[32m✔�[39m Basic security coverage for first-party code vulnerabilities.

[00.49][ERROR]: unable to find a config; path .coderabbit-opengrep-fallback.yml does not exist

packages/nemo_evaluator_sdk/tests/agent_eval/test_example_metrics.py

┌──────────────┐
│ Opengrep CLI │
└──────────────┘

�[32m✔�[39m �[1mOpengrep OSS�[0m
�[32m✔�[39m Basic security coverage for first-party code vulnerabilities.

[00.15][ERROR]: unable to find a config; path .coderabbit-opengrep-fallback.yml does not exist

packages/nemo_evaluator_sdk/tests/agent_eval/test_evidence.py

┌──────────────┐
│ Opengrep CLI │
└──────────────┘

�[32m✔�[39m �[1mOpengrep OSS�[0m
�[32m✔�[39m Basic security coverage for first-party code vulnerabilities.

[00.13][ERROR]: unable to find a config; path .coderabbit-opengrep-fallback.yml does not exist

1 others

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 5

♻️ Duplicate comments (1)

packages/nemo_evaluator_sdk/src/nemo_evaluator_sdk/agent_eval/trials.py (1)
161-168: 🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift

This strips source format for file-based traces.

standard_evidence_descriptors() can now only emit atif or json, so OTEL/OpenInference exports passed via trace_path can never hit their normalizers on this path. This needs an explicit trace-format input instead of filename heuristics.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/nemo_evaluator_sdk/src/nemo_evaluator_sdk/agent_eval/trials.py`
around lines 161 - 168, The trace descriptor normalization is using filename
heuristics in standard_evidence_descriptors() / the trace_path handling, which
causes file-based OTEL/OpenInference traces to lose their original format.
Update the evidence-building flow to accept an explicit trace-format input and
pass that through when constructing the EvidenceDescriptor for EVIDENCE_TRACE,
instead of inferring “atif” vs “json” from is_atif or the path name. Ensure
normalize_trace_descriptor and the caller in trials.py preserve the source
format so the correct normalizer can run.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/nemo_evaluator_sdk/examples/run_agent_eval/example_metrics.py`:
- Around line 19-21: The retry detection in the example metrics logic is
counting global frequency instead of consecutive repeats, so update the relevant
code in example_metrics.py to measure the longest streak of identical calls in
the trace rather than total occurrences. In the call-counting path around the
metrics computation (the logic that uses EVIDENCE_TRACE), canonicalize nested
args before comparing so semantically equivalent dicts with different insertion
order map to the same payload, and keep the streak bounded to adjacent entries
only. Use the existing symbols EVIDENCE_TRACE, EVIDENCE_INITIAL_STATE, and
EVIDENCE_FINAL_STATE to locate the trace-processing code and adjust the retry
metric accordingly.

In `@packages/nemo_evaluator_sdk/src/nemo_evaluator_sdk/values/evidence.py`:
- Around line 490-510: normalize_trace() currently drops any list-shaped payload
unless source_format is explicitly otel/openinference, which can erase live
evaluator trajectory evidence. Update normalize_trace() in evidence.py to
preserve unknown list traces by either routing them through the appropriate
list-based normalizer or explicitly validating and rejecting unsupported list
shapes instead of returning an empty AtifTrace(). Keep the existing dict
handling for steps/events and ensure the fallback path for bare lists is handled
safely rather than silently discarding data.
- Around line 534-541: The ref-handling in `_local_filesystem_ref()` / the
`descriptor.ref is not None` branch currently raises for remote refs before the
`is_file()` no-op path can run, which conflicts with the documented behavior.
Update the logic so non-local or unresolvable refs are detected and returned
unchanged in `descriptor.model_copy` flow, and only call
`_local_filesystem_ref()` after confirming the ref is local or otherwise safe to
resolve. Keep the existing `normalize_trace` and `.atif.json` write path for
local files, but preserve the unchanged descriptor behavior for remote refs.
- Around line 671-682: The `Evidence.trace()` accessor is currently using
`self.require(name)` without verifying the descriptor kind, so a non-trace entry
can be wrapped as a `TraceHandle` and look valid. Update `trace()` to enforce
`kind="trace"` before constructing the handle, matching the stricter contract
already implied by `TraceHandle` and the existing `filesystem()`/`logs()`
accessors. Keep the cache behavior in `_trace_cache`, but ensure only true trace
descriptors are accepted and anything else fails immediately.

In `@packages/nemo_evaluator_sdk/tests/agent_eval/test_evidence.py`:
- Around line 5-7: The process cleanup assertion in test_evidence.py is using a
fixed 200 ms wait, which can make the test flaky if the child is briefly a
zombie on a busy runner. Update the test around the cleanup check to poll
`os.kill(pid, 0)` until a short deadline instead of sleeping once, so the
assertion waits for reaping to complete. Use the existing process-tree
cleanup/test logic in `test_evidence` to locate the affected check and keep the
polling window small.

---

Duplicate comments:
In `@packages/nemo_evaluator_sdk/src/nemo_evaluator_sdk/agent_eval/trials.py`:
- Around line 161-168: The trace descriptor normalization is using filename
heuristics in standard_evidence_descriptors() / the trace_path handling, which
causes file-based OTEL/OpenInference traces to lose their original format.
Update the evidence-building flow to accept an explicit trace-format input and
pass that through when constructing the EvidenceDescriptor for EVIDENCE_TRACE,
instead of inferring “atif” vs “json” from is_atif or the path name. Ensure
normalize_trace_descriptor and the caller in trials.py preserve the source
format so the correct normalizer can run.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 739e49cd-e370-4b40-991d-c3b8b5189fc5

📥 Commits

Reviewing files that changed from the base of the PR and between e1ac31b and 9c8325f.

⛔ Files ignored due to path filters (4)

sdk/python/nemo-platform/src/nemo_platform/beta/evaluator/agent_eval/evaluator.py is excluded by !sdk/**
sdk/python/nemo-platform/src/nemo_platform/beta/evaluator/agent_eval/trials.py is excluded by !sdk/**
sdk/python/nemo-platform/src/nemo_platform/beta/evaluator/values/__init__.py is excluded by !sdk/**
sdk/python/nemo-platform/src/nemo_platform/beta/evaluator/values/evidence.py is excluded by !sdk/**

📒 Files selected for processing (7)

packages/nemo_evaluator_sdk/examples/run_agent_eval/example_metrics.py
packages/nemo_evaluator_sdk/src/nemo_evaluator_sdk/agent_eval/evaluator.py
packages/nemo_evaluator_sdk/src/nemo_evaluator_sdk/agent_eval/trials.py
packages/nemo_evaluator_sdk/src/nemo_evaluator_sdk/values/__init__.py
packages/nemo_evaluator_sdk/src/nemo_evaluator_sdk/values/evidence.py
packages/nemo_evaluator_sdk/tests/agent_eval/test_evidence.py
packages/nemo_evaluator_sdk/tests/agent_eval/test_example_metrics.py

coderabbitai · 2026-06-26T07:11:17Z

 from collections.abc import Sequence

+from nemo_evaluator_sdk.agent_eval.trials import EVIDENCE_FINAL_STATE, EVIDENCE_INITIAL_STATE, EVIDENCE_TRACE


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Count retry streaks, not global frequency.

Lines 118-121 count identical calls across the entire trace. That flags legitimate reuse separated by other work as a retry loop, and the current key still splits semantically identical nested args when dict insertion order differs. Track the longest consecutive run from a canonicalized call payload instead.

Suggested fix

+import json from collections.abc import Sequence @@ async def compute_scores(self, input: MetricInput) -> MetricResult: max_repeats = 0 evidence = input.candidate.evidence if evidence is not None and evidence.get(self._evidence_name) is not None: calls = await (await evidence.trace(self._evidence_name)).tool_calls() - counts: dict[str, int] = {} + previous_key: str | None = None + current_repeats = 0 for call in calls: - key = f"{call.name}:{sorted((call.arguments or {}).items())}" - counts[key] = counts.get(key, 0) + 1 - max_repeats = max(counts.values(), default=0) + key = json.dumps( + {"name": call.name, "arguments": call.arguments or {}}, + sort_keys=True, + separators=(",", ":"), + ) + current_repeats = current_repeats + 1 if key == previous_key else 1 + previous_key = key + max_repeats = max(max_repeats, current_repeats) return MetricResult( outputs=[ MetricOutput(name="efficient_tool_use", value=max_repeats <= self._threshold), MetricOutput(name="max_repeated_tool_calls", value=max_repeats), ]

Also applies to: 112-125

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/nemo_evaluator_sdk/examples/run_agent_eval/example_metrics.py` around lines 19 - 21, The retry detection in the example metrics logic is counting global frequency instead of consecutive repeats, so update the relevant code in example_metrics.py to measure the longest streak of identical calls in the trace rather than total occurrences. In the call-counting path around the metrics computation (the logic that uses EVIDENCE_TRACE), canonicalize nested args before comparing so semantically equivalent dicts with different insertion order map to the same payload, and keep the streak bounded to adjacent entries only. Use the existing symbols EVIDENCE_TRACE, EVIDENCE_INITIAL_STATE, and EVIDENCE_FINAL_STATE to locate the trace-processing code and adjust the retry metric accordingly.

coderabbitai · 2026-06-26T07:11:17Z

+def normalize_trace(payload: Any, *, source_format: str | None = None) -> AtifTrace:
+    """Normalize a raw trace payload into an :class:`AtifTrace`.
+
+    ``source_format`` (from the evidence descriptor) routes to a span normalizer
+    when set to ``otel``/``openinference``; otherwise the payload shape decides:
+    a NAT trajectory (``steps``), an already-normalized trace (``events``), or a
+    bare span list.
+    """
+    if isinstance(payload, AtifTrace):
+        return payload
+    fmt = (source_format or "").lower()
+    if fmt in {"otel", "opentelemetry"} and isinstance(payload, list):
+        return normalize_otel_spans(payload)
+    if fmt == "openinference" and isinstance(payload, list):
+        return normalize_openinference_spans(payload)
+    if isinstance(payload, dict):
+        if "steps" in payload:
+            return normalize_nat_trajectory(payload)
+        if "events" in payload:
+            return AtifTrace.model_validate(payload)
+    return AtifTrace()


🎯 Functional Correctness | 🟠 Major | 🏗️ Heavy lift

Unknown list traces are dropped here.

normalize_trace() returns AtifTrace() for any list unless source_format is otel/openinference. The live evaluator path still feeds list-shaped trajectory payloads, so this change can erase trace evidence instead of normalizing or rejecting it.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/nemo_evaluator_sdk/src/nemo_evaluator_sdk/values/evidence.py` around lines 490 - 510, normalize_trace() currently drops any list-shaped payload unless source_format is explicitly otel/openinference, which can erase live evaluator trajectory evidence. Update normalize_trace() in evidence.py to preserve unknown list traces by either routing them through the appropriate list-based normalizer or explicitly validating and rejecting unsupported list shapes instead of returning an empty AtifTrace(). Keep the existing dict handling for steps/events and ensure the fallback path for bare lists is handled safely rather than silently discarding data.

coderabbitai · 2026-06-26T07:11:17Z

+    if descriptor.ref is not None:
+        source = _local_filesystem_ref(descriptor.ref)
+        if not source.is_file():
+            return descriptor
+        trace = normalize_trace(_read_trace_payload(descriptor), source_format=descriptor.format)
+        atif_path = source.with_name(f"{source.stem}.atif.json")
+        atif_path.write_text(trace.model_dump_json(), encoding="utf-8")
+        return descriptor.model_copy(update={"format": "atif", "ref": str(atif_path), "data": None})


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Remote trace refs now throw instead of no-oping.

The docstring says unresolvable refs are returned unchanged, but _local_filesystem_ref() raises on non-local refs before the is_file() guard runs.

Proposed fix

if descriptor.ref is not None: - source = _local_filesystem_ref(descriptor.ref) + try: + source = _local_filesystem_ref(descriptor.ref) + except ValueError: + return descriptor if not source.is_file(): return descriptor trace = normalize_trace(_read_trace_payload(descriptor), source_format=descriptor.format)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if descriptor.ref is not None:

source = _local_filesystem_ref(descriptor.ref)

if not source.is_file():

return descriptor

trace = normalize_trace(_read_trace_payload(descriptor), source_format=descriptor.format)

atif_path = source.with_name(f"{source.stem}.atif.json")

atif_path.write_text(trace.model_dump_json(), encoding="utf-8")

return descriptor.model_copy(update={"format": "atif", "ref": str(atif_path), "data": None})

if descriptor.ref is not None:

try:

source = _local_filesystem_ref(descriptor.ref)

except ValueError:

return descriptor

if not source.is_file():

return descriptor

trace = normalize_trace(_read_trace_payload(descriptor), source_format=descriptor.format)

atif_path = source.with_name(f"{source.stem}.atif.json")

atif_path.write_text(trace.model_dump_json(), encoding="utf-8")

return descriptor.model_copy(update={"format": "atif", "ref": str(atif_path), "data": None})

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/nemo_evaluator_sdk/src/nemo_evaluator_sdk/values/evidence.py` around lines 534 - 541, The ref-handling in `_local_filesystem_ref()` / the `descriptor.ref is not None` branch currently raises for remote refs before the `is_file()` no-op path can run, which conflicts with the documented behavior. Update the logic so non-local or unresolvable refs are detected and returned unchanged in `descriptor.model_copy` flow, and only call `_local_filesystem_ref()` after confirming the ref is local or otherwise safe to resolve. Keep the existing `normalize_trace` and `.atif.json` write path for local files, but preserve the unchanged descriptor behavior for remote refs.

coderabbitai · 2026-06-26T07:11:17Z

+    async def trace(self, name: str = "trace") -> TraceHandle:
+        """Return a cached normalized-trace handle for a named trace descriptor.
+
+        Async for consistency with :meth:`filesystem`/:meth:`logs` and to leave room
+        for ref staging/validation; the trace itself is read lazily on first access.
+        """
+        cached = self._trace_cache.get(name)
+        if cached is not None:
+            return cached
+        handle = TraceHandle(self.require(name))
+        self._trace_cache[name] = handle
+        return handle


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Enforce kind="trace" here.

Unlike filesystem() and logs(), this accessor accepts any descriptor kind. That currently turns a non-trace "trace" entry into an empty trace instead of surfacing the contract break.

Proposed fix

- handle = TraceHandle(self.require(name)) + handle = TraceHandle(self.require(name, kind="trace"))

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

async def trace(self, name: str = "trace") -> TraceHandle:

"""Return a cached normalized-trace handle for a named trace descriptor.

Async for consistency with :meth:`filesystem`/:meth:`logs` and to leave room

for ref staging/validation; the trace itself is read lazily on first access.

"""

cached = self._trace_cache.get(name)

if cached is not None:

return cached

handle = TraceHandle(self.require(name))

self._trace_cache[name] = handle

return handle

async def trace(self, name: str = "trace") -> TraceHandle:

"""Return a cached normalized-trace handle for a named trace descriptor.

Async for consistency with :meth:`filesystem`/:meth:`logs` and to leave room

for ref staging/validation; the trace itself is read lazily on first access.

"""

cached = self._trace_cache.get(name)

if cached is not None:

return cached

handle = TraceHandle(self.require(name, kind="trace"))

self._trace_cache[name] = handle

return handle

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/nemo_evaluator_sdk/src/nemo_evaluator_sdk/values/evidence.py` around lines 671 - 682, The `Evidence.trace()` accessor is currently using `self.require(name)` without verifying the descriptor kind, so a non-trace entry can be wrapped as a `TraceHandle` and look valid. Update `trace()` to enforce `kind="trace"` before constructing the handle, matching the stricter contract already implied by `TraceHandle` and the existing `filesystem()`/`logs()` accessors. Keep the cache behavior in `_trace_cache`, but ensure only true trace descriptors are accepted and anything else fails immediately.

coderabbitai · 2026-06-26T07:11:17Z

+import json
+import os
 from pathlib import Path


🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

Avoid the fixed 200 ms reap window.

Line 168 assumes the killed child will be fully reaped within 200 ms. On a busy runner, os.kill(pid, 0) can still see a zombie briefly and make this test flaky even when the process tree cleanup is correct. Poll to a short deadline instead.

Suggested fix

import json import os +import time from pathlib import Path @@ child_pid = int(pidfile.read_text().strip()) - await asyncio.sleep(0.2) # let the OS finish reaping the killed group - with pytest.raises(ProcessLookupError): - os.kill(child_pid, 0) + deadline = time.monotonic() + 2 + while True: + try: + os.kill(child_pid, 0) + except ProcessLookupError: + break + if time.monotonic() >= deadline: + pytest.fail(f"child process {child_pid} still exists after timeout cleanup") + await asyncio.sleep(0.05)

Also applies to: 167-170

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/nemo_evaluator_sdk/tests/agent_eval/test_evidence.py` around lines 5 - 7, The process cleanup assertion in test_evidence.py is using a fixed 200 ms wait, which can make the test flaky if the child is briefly a zombie on a busy runner. Update the test around the cleanup check to poll `os.kill(pid, 0)` until a short deadline instead of sleeping once, so the assertion waits for reaping to complete. Use the existing process-tree cleanup/test logic in `test_evidence` to locate the affected check and keep the polling window small.

arpitsardhana · 2026-06-26T07:22:44Z

+    metadata: dict[str, Any] = Field(default_factory=dict)
+
+
+def normalize_nat_trajectory(payload: dict[str, Any]) -> AtifTrace:


this normalisation is not accurate. need to be fixed

Add the ATIF (Agent Trace Interchange Format) model and normalizers for NAT trajectories, OpenTelemetry GenAI spans, and OpenInference spans, plus the TraceHandle and LogHandle read handles exposed via CandidateEvidence.trace() and CandidateEvidence.logs(). Includes the example-only inefficient_retry_loop metric that scores over the normalized trace evidence. Vendored mirror synced via make vendor. Signed-off-by: Arpit Singh (SW-CLOUD) <arpsingh@nvidia.com>

SandyChapman · 2026-06-26T12:32:33Z

+        initial_name: str = EVIDENCE_INITIAL_STATE,
+        final_name: str = EVIDENCE_FINAL_STATE,


Question, do you think these need to be injected here? When would the initial/final states have a different name?

SandyChapman · 2026-06-26T12:34:13Z

+        if evidence is not None and evidence.get(self._evidence_name) is not None:
+            calls = await (await evidence.trace(self._evidence_name)).tool_calls()


Why would the user choose to get different evidence types in the metric? It seems like this always relies on it being EVIDENCE_TRACE, right?

SandyChapman · 2026-06-26T12:43:14Z

+        """Return the normalized ATIF trajectory, reading and parsing on first access."""
+        if self._trajectory is None:
+            payload = await asyncio.to_thread(self._load_payload)
+            self._trajectory = normalize_trace(payload, source_format=self._descriptor.format)


I don't think I quite understand the process in the PR. We end up normalizing the EvidenceDescriptor which converts the payload to atif, but we also call it here to normalize the trajectory. Do we need both? Or am I reading this incorrectly. I'd assume the normalizing happening here is sufficient and we don't need to pre-emptively normalize the evidence descriptor.

SandyChapman · 2026-06-26T12:44:51Z

-        trace = EvidenceDescriptor(kind="trace", format="json", data=sample["trajectory"])
+        # Normalize to ATIF before the trial is persisted so the stored shape is
+        # source-agnostic (sources in, ATIF out); TraceHandle then reads it uniformly.
+        trace = normalize_trace_descriptor(EvidenceDescriptor(kind="trace", format="json", data=sample["trajectory"]))


I don't think this is the right approach tbh. I think persisting the raw trace and convert on metric usage is the right approach. The problem with converting before persisting is that if there's a bug in our conversion code or we don't handle the source format fully, the conversion may be lossy. At least if we convert on read, users can improve the conversion function without loss.

github-actions Bot added the feat label Jun 24, 2026

arpitsardhana changed the title ~~feat(evaluator): P0 evidence handles, ATIF traces, and taskset~~ feat(evaluator): filesystem extension, Evidence handles, ATIF traces, and taskset Jun 24, 2026

crookedstorm force-pushed the main branch from bfb0e6b to ed04d87 Compare June 24, 2026 18:07

arpitsardhana force-pushed the aalgo-258-p0-evidence-metrics/arpsingh branch 3 times, most recently from 55ce862 to 15882d4 Compare June 24, 2026 22:04

arpitsardhana commented Jun 25, 2026

View reviewed changes

arpitsardhana force-pushed the aalgo-258-p0-evidence-metrics/arpsingh branch 2 times, most recently from 948dc8a to 9c8325f Compare June 26, 2026 06:55

arpitsardhana changed the title ~~feat(evaluator): filesystem extension, Evidence handles, ATIF traces, and taskset~~ feat(evaluator): ATIF trace normalization and trace/log evidence read handles Jun 26, 2026

arpitsardhana changed the title ~~feat(evaluator): ATIF trace normalization and trace/log evidence read handles~~ feat(evaluator): Evidence handles, ATIF traces and trace normalisation Jun 26, 2026

arpitsardhana marked this pull request as ready for review June 26, 2026 07:00

arpitsardhana requested review from a team as code owners June 26, 2026 07:00

arpitsardhana requested review from SandyChapman and ngoncharenko June 26, 2026 07:00

coderabbitai Bot reviewed Jun 26, 2026

View reviewed changes

arpitsardhana commented Jun 26, 2026

View reviewed changes

arpitsardhana force-pushed the aalgo-258-p0-evidence-metrics/arpsingh branch from 9c8325f to 2da7a67 Compare June 26, 2026 07:52

SandyChapman reviewed Jun 26, 2026

View reviewed changes



		# (kind_attr, tool_kinds, tool_name_attr, prompt_attr, completion_attr)
		_OTEL_KEYS = (

		from collections.abc import Sequence

		from nemo_evaluator_sdk.agent_eval.trials import EVIDENCE_FINAL_STATE, EVIDENCE_INITIAL_STATE, EVIDENCE_TRACE

		metadata: dict[str, Any] = Field(default_factory=dict)


		def normalize_nat_trajectory(payload: dict[str, Any]) -> AtifTrace:

		initial_name: str = EVIDENCE_INITIAL_STATE,
		final_name: str = EVIDENCE_FINAL_STATE,

		if evidence is not None and evidence.get(self._evidence_name) is not None:
		calls = await (await evidence.trace(self._evidence_name)).tool_calls()

Uh oh!

Conversation

arpitsardhana commented Jun 24, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Notes

Summary by CodeRabbit

Uh oh!

github-actions Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arpitsardhana Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

arpitsardhana Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

arpitsardhana Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

SandyChapman Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

SandyChapman Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

SandyChapman Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

SandyChapman Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

arpitsardhana commented Jun 24, 2026 •

edited by coderabbitai Bot

Loading

github-actions Bot commented Jun 24, 2026 •

edited

Loading

coderabbitai Bot commented Jun 26, 2026 •

edited

Loading