Skip to content

feat(evaluator): Evidence handles, ATIF traces and trace normalisation#432

Open
arpitsardhana wants to merge 1 commit into
mainfrom
aalgo-258-p0-evidence-metrics/arpsingh
Open

feat(evaluator): Evidence handles, ATIF traces and trace normalisation#432
arpitsardhana wants to merge 1 commit into
mainfrom
aalgo-258-p0-evidence-metrics/arpsingh

Conversation

@arpitsardhana

@arpitsardhana arpitsardhana commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds the portable trace layer for agent-eval evidence so metrics can score off what the agent actually did (the trace and logs), independent of the producing runtime. Rebased onto main after the filesystem handle / run_verifier / taskset work landed via #447, so this PR is now scoped to traces and logs only.

  • ATIF data model (values/evidence.py): AtifTrace / AtifEvent / AtifEventType / AtifTokenUsage — a single normalized, persisted trace format with token accounting.
  • Source normalizers: normalize_trace dispatches by descriptor format/payload shape to normalize_nat_trajectory, normalize_otel_spans, and normalize_openinference_spans, so NAT trajectories and OTel/OpenInference spans all map to ATIF.
  • Persist-time normalization: normalize_trace_descriptor (idempotent, preserve-modality) and normalize_candidate_evidence run before a trial is persisted — wired into the live inline path (evaluator._trial_from_sample) and the file path (trials.standard_evidence_descriptors). Inline data is normalized in place; a file ref is normalized into a sibling *.atif.json. Result: persisted trace evidence is ATIF regardless of source.
  • Read handles: CandidateEvidence.trace()TraceHandle (events / tool_calls / token_usage) and .logs()LogHandle (list_files / read_text / tail), both async and per-trial cached so sibling metrics share a materialized handle.
  • Well-known keys: WellKnownEvidenceKey Literal (initial_state / trace / logs / final_state / verifier_logs).
  • Example reference metric (examples/run_agent_eval/example_metrics.py, example-only): inefficient_retry_loop scoring off TraceHandle.
  • Vendored SDK mirror synced via make vendor.

Test plan

  • ruff check + ruff format --check clean on changed files
  • ty check clean
  • pytest packages/nemo_evaluator_sdk/tests/agent_eval/ → 68 passed (NAT/OTel/OpenInference normalization, inline + file-ref persist-time normalization round-trips, trace/log handle access, example metric)
  • make vendor regenerated mirror, no drift
  • CI green

Notes

  • All additions live in values/evidence.py (no new value module) per review preference.
  • Persist-time normalization preserves storage modality (inline stays inline; file refs get a sibling *.atif.json) and is a no-op for already-ATIF or missing-file descriptors.

Summary by CodeRabbit

  • New Features
    • Added normalized trace and log evidence support with cached access to trace steps, tool calls, and token usage.
    • Introduced InefficientRetryLoopMetric to detect repeated tool-call loops and report the maximum repeats.
  • Bug Fixes
    • Standardized trace evidence descriptors by normalizing them to ATIF format (including file-backed traces).
    • Ensured evidence imports and trial construction use normalized trace descriptors consistently.
    • Improved verifier timeout cleanup by validating the full process tree is terminated.
  • Tests
    • Added/expanded tests for trace normalization round-trips, trace/log handle behavior, and the retry-loop metric.

@github-actions github-actions Bot added the feat label Jun 24, 2026
@arpitsardhana arpitsardhana changed the title feat(evaluator): P0 evidence handles, ATIF traces, and taskset feat(evaluator): filesystem extension, Evidence handles, ATIF traces, and taskset Jun 24, 2026
@github-actions

github-actions Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor
Suite Lines Covered Line Rate Branch Rate
Unit Tests 21624/28290 76.4% 61.3%
Integration Tests 12490/27059 46.2% 19.3%

@arpitsardhana arpitsardhana force-pushed the aalgo-258-p0-evidence-metrics/arpsingh branch 3 times, most recently from 55ce862 to 15882d4 Compare June 24, 2026 22:04
"""Return the normalized trace, reading and parsing on first access."""
if self._trace is None:
payload = await asyncio.to_thread(self._load_payload)
self._trace = normalize_trace(payload, source_format=self._descriptor.format)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fabric?

Trace handle need to understand source format



# (kind_attr, tool_kinds, tool_name_attr, prompt_attr, completion_attr)
_OTEL_KEYS = (

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add codex and Claude format

See jsonl based multi -turn conversations

@arpitsardhana arpitsardhana force-pushed the aalgo-258-p0-evidence-metrics/arpsingh branch 2 times, most recently from 948dc8a to 9c8325f Compare June 26, 2026 06:55
@arpitsardhana arpitsardhana changed the title feat(evaluator): filesystem extension, Evidence handles, ATIF traces, and taskset feat(evaluator): ATIF trace normalization and trace/log evidence read handles Jun 26, 2026
@arpitsardhana arpitsardhana changed the title feat(evaluator): ATIF trace normalization and trace/log evidence read handles feat(evaluator): Evidence handles, ATIF traces and trace normalisation Jun 26, 2026
@arpitsardhana arpitsardhana marked this pull request as ready for review June 26, 2026 07:00
@arpitsardhana arpitsardhana requested review from a team as code owners June 26, 2026 07:00
@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown

Review Change Stack

Caution

Review failed

An error occurred during the review process. Please try again later.

📝 Walkthrough

Walkthrough

Trace evidence is normalized to ATIF during trial creation and descriptor generation. Evidence handles expose normalized traces and logs. Example metrics now use evidence constants and add a retry-loop metric. Tests cover normalization, log access, timeout cleanup, and the new metric.

Changes

ATIF trace evidence and retry metrics

Layer / File(s) Summary
ATIF data models
packages/nemo_evaluator_sdk/src/nemo_evaluator_sdk/values/atif.py
ATIF schema models, content validation, subagent references, tool calls, observations, metrics, and agent fields are added.
Trajectory validation and helpers
packages/nemo_evaluator_sdk/src/nemo_evaluator_sdk/values/atif.py
Trajectory validation, JSON export, sequential step checks, subagent id checks, tool-call reference checks, and multimodal detection are added.
Trace normalization and evidence handles
packages/nemo_evaluator_sdk/src/nemo_evaluator_sdk/values/evidence.py, packages/nemo_evaluator_sdk/src/nemo_evaluator_sdk/values/__init__.py
Trace payload normalization, descriptor rewriting, candidate normalization, trace and log handles, cached accessors, and public evidence exports are added.
Trace descriptor wiring
packages/nemo_evaluator_sdk/src/nemo_evaluator_sdk/agent_eval/evaluator.py, packages/nemo_evaluator_sdk/src/nemo_evaluator_sdk/agent_eval/trials.py
Trial trace evidence and standard evidence descriptors are normalized before storage.
Example metrics
packages/nemo_evaluator_sdk/examples/run_agent_eval/example_metrics.py
Example metrics switch to evidence constants and add the trace-based retry-loop metric.
Tests
packages/nemo_evaluator_sdk/tests/agent_eval/test_evidence.py, packages/nemo_evaluator_sdk/tests/agent_eval/test_example_metrics.py
Tests cover trace normalization, log access, verifier timeout handling, and the retry-loop metric.

Possibly related PRs

Suggested reviewers

  • SandyChapman
  • ngoncharenko
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 53.57% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: evidence handles, ATIF traces, and trace normalization.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch aalgo-258-p0-evidence-metrics/arpsingh

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 OpenGrep (1.23.0)
packages/nemo_evaluator_sdk/src/nemo_evaluator_sdk/agent_eval/trials.py

┌──────────────┐
│ Opengrep CLI │
└──────────────┘

�[32m✔�[39m �[1mOpengrep OSS�[0m
�[32m✔�[39m Basic security coverage for first-party code vulnerabilities.

[00.49][ERROR]: unable to find a config; path .coderabbit-opengrep-fallback.yml does not exist

packages/nemo_evaluator_sdk/tests/agent_eval/test_example_metrics.py

┌──────────────┐
│ Opengrep CLI │
└──────────────┘

�[32m✔�[39m �[1mOpengrep OSS�[0m
�[32m✔�[39m Basic security coverage for first-party code vulnerabilities.

[00.15][ERROR]: unable to find a config; path .coderabbit-opengrep-fallback.yml does not exist

packages/nemo_evaluator_sdk/tests/agent_eval/test_evidence.py

┌──────────────┐
│ Opengrep CLI │
└──────────────┘

�[32m✔�[39m �[1mOpengrep OSS�[0m
�[32m✔�[39m Basic security coverage for first-party code vulnerabilities.

[00.13][ERROR]: unable to find a config; path .coderabbit-opengrep-fallback.yml does not exist

  • 1 others

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (1)
packages/nemo_evaluator_sdk/src/nemo_evaluator_sdk/agent_eval/trials.py (1)

161-168: 🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift

This strips source format for file-based traces.

standard_evidence_descriptors() can now only emit atif or json, so OTEL/OpenInference exports passed via trace_path can never hit their normalizers on this path. This needs an explicit trace-format input instead of filename heuristics.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/nemo_evaluator_sdk/src/nemo_evaluator_sdk/agent_eval/trials.py`
around lines 161 - 168, The trace descriptor normalization is using filename
heuristics in standard_evidence_descriptors() / the trace_path handling, which
causes file-based OTEL/OpenInference traces to lose their original format.
Update the evidence-building flow to accept an explicit trace-format input and
pass that through when constructing the EvidenceDescriptor for EVIDENCE_TRACE,
instead of inferring “atif” vs “json” from is_atif or the path name. Ensure
normalize_trace_descriptor and the caller in trials.py preserve the source
format so the correct normalizer can run.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/nemo_evaluator_sdk/examples/run_agent_eval/example_metrics.py`:
- Around line 19-21: The retry detection in the example metrics logic is
counting global frequency instead of consecutive repeats, so update the relevant
code in example_metrics.py to measure the longest streak of identical calls in
the trace rather than total occurrences. In the call-counting path around the
metrics computation (the logic that uses EVIDENCE_TRACE), canonicalize nested
args before comparing so semantically equivalent dicts with different insertion
order map to the same payload, and keep the streak bounded to adjacent entries
only. Use the existing symbols EVIDENCE_TRACE, EVIDENCE_INITIAL_STATE, and
EVIDENCE_FINAL_STATE to locate the trace-processing code and adjust the retry
metric accordingly.

In `@packages/nemo_evaluator_sdk/src/nemo_evaluator_sdk/values/evidence.py`:
- Around line 490-510: normalize_trace() currently drops any list-shaped payload
unless source_format is explicitly otel/openinference, which can erase live
evaluator trajectory evidence. Update normalize_trace() in evidence.py to
preserve unknown list traces by either routing them through the appropriate
list-based normalizer or explicitly validating and rejecting unsupported list
shapes instead of returning an empty AtifTrace(). Keep the existing dict
handling for steps/events and ensure the fallback path for bare lists is handled
safely rather than silently discarding data.
- Around line 534-541: The ref-handling in `_local_filesystem_ref()` / the
`descriptor.ref is not None` branch currently raises for remote refs before the
`is_file()` no-op path can run, which conflicts with the documented behavior.
Update the logic so non-local or unresolvable refs are detected and returned
unchanged in `descriptor.model_copy` flow, and only call
`_local_filesystem_ref()` after confirming the ref is local or otherwise safe to
resolve. Keep the existing `normalize_trace` and `.atif.json` write path for
local files, but preserve the unchanged descriptor behavior for remote refs.
- Around line 671-682: The `Evidence.trace()` accessor is currently using
`self.require(name)` without verifying the descriptor kind, so a non-trace entry
can be wrapped as a `TraceHandle` and look valid. Update `trace()` to enforce
`kind="trace"` before constructing the handle, matching the stricter contract
already implied by `TraceHandle` and the existing `filesystem()`/`logs()`
accessors. Keep the cache behavior in `_trace_cache`, but ensure only true trace
descriptors are accepted and anything else fails immediately.

In `@packages/nemo_evaluator_sdk/tests/agent_eval/test_evidence.py`:
- Around line 5-7: The process cleanup assertion in test_evidence.py is using a
fixed 200 ms wait, which can make the test flaky if the child is briefly a
zombie on a busy runner. Update the test around the cleanup check to poll
`os.kill(pid, 0)` until a short deadline instead of sleeping once, so the
assertion waits for reaping to complete. Use the existing process-tree
cleanup/test logic in `test_evidence` to locate the affected check and keep the
polling window small.

---

Duplicate comments:
In `@packages/nemo_evaluator_sdk/src/nemo_evaluator_sdk/agent_eval/trials.py`:
- Around line 161-168: The trace descriptor normalization is using filename
heuristics in standard_evidence_descriptors() / the trace_path handling, which
causes file-based OTEL/OpenInference traces to lose their original format.
Update the evidence-building flow to accept an explicit trace-format input and
pass that through when constructing the EvidenceDescriptor for EVIDENCE_TRACE,
instead of inferring “atif” vs “json” from is_atif or the path name. Ensure
normalize_trace_descriptor and the caller in trials.py preserve the source
format so the correct normalizer can run.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 739e49cd-e370-4b40-991d-c3b8b5189fc5

📥 Commits

Reviewing files that changed from the base of the PR and between e1ac31b and 9c8325f.

⛔ Files ignored due to path filters (4)
  • sdk/python/nemo-platform/src/nemo_platform/beta/evaluator/agent_eval/evaluator.py is excluded by !sdk/**
  • sdk/python/nemo-platform/src/nemo_platform/beta/evaluator/agent_eval/trials.py is excluded by !sdk/**
  • sdk/python/nemo-platform/src/nemo_platform/beta/evaluator/values/__init__.py is excluded by !sdk/**
  • sdk/python/nemo-platform/src/nemo_platform/beta/evaluator/values/evidence.py is excluded by !sdk/**
📒 Files selected for processing (7)
  • packages/nemo_evaluator_sdk/examples/run_agent_eval/example_metrics.py
  • packages/nemo_evaluator_sdk/src/nemo_evaluator_sdk/agent_eval/evaluator.py
  • packages/nemo_evaluator_sdk/src/nemo_evaluator_sdk/agent_eval/trials.py
  • packages/nemo_evaluator_sdk/src/nemo_evaluator_sdk/values/__init__.py
  • packages/nemo_evaluator_sdk/src/nemo_evaluator_sdk/values/evidence.py
  • packages/nemo_evaluator_sdk/tests/agent_eval/test_evidence.py
  • packages/nemo_evaluator_sdk/tests/agent_eval/test_example_metrics.py

Comment on lines 19 to +21
from collections.abc import Sequence

from nemo_evaluator_sdk.agent_eval.trials import EVIDENCE_FINAL_STATE, EVIDENCE_INITIAL_STATE, EVIDENCE_TRACE

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Count retry streaks, not global frequency.

Lines 118-121 count identical calls across the entire trace. That flags legitimate reuse separated by other work as a retry loop, and the current key still splits semantically identical nested args when dict insertion order differs. Track the longest consecutive run from a canonicalized call payload instead.

Suggested fix
+import json
 from collections.abc import Sequence
@@
     async def compute_scores(self, input: MetricInput) -> MetricResult:
         max_repeats = 0
         evidence = input.candidate.evidence
         if evidence is not None and evidence.get(self._evidence_name) is not None:
             calls = await (await evidence.trace(self._evidence_name)).tool_calls()
-            counts: dict[str, int] = {}
+            previous_key: str | None = None
+            current_repeats = 0
             for call in calls:
-                key = f"{call.name}:{sorted((call.arguments or {}).items())}"
-                counts[key] = counts.get(key, 0) + 1
-            max_repeats = max(counts.values(), default=0)
+                key = json.dumps(
+                    {"name": call.name, "arguments": call.arguments or {}},
+                    sort_keys=True,
+                    separators=(",", ":"),
+                )
+                current_repeats = current_repeats + 1 if key == previous_key else 1
+                previous_key = key
+                max_repeats = max(max_repeats, current_repeats)
         return MetricResult(
             outputs=[
                 MetricOutput(name="efficient_tool_use", value=max_repeats <= self._threshold),
                 MetricOutput(name="max_repeated_tool_calls", value=max_repeats),
             ]

Also applies to: 112-125

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/nemo_evaluator_sdk/examples/run_agent_eval/example_metrics.py`
around lines 19 - 21, The retry detection in the example metrics logic is
counting global frequency instead of consecutive repeats, so update the relevant
code in example_metrics.py to measure the longest streak of identical calls in
the trace rather than total occurrences. In the call-counting path around the
metrics computation (the logic that uses EVIDENCE_TRACE), canonicalize nested
args before comparing so semantically equivalent dicts with different insertion
order map to the same payload, and keep the streak bounded to adjacent entries
only. Use the existing symbols EVIDENCE_TRACE, EVIDENCE_INITIAL_STATE, and
EVIDENCE_FINAL_STATE to locate the trace-processing code and adjust the retry
metric accordingly.

Comment on lines +490 to +510
def normalize_trace(payload: Any, *, source_format: str | None = None) -> AtifTrace:
"""Normalize a raw trace payload into an :class:`AtifTrace`.

``source_format`` (from the evidence descriptor) routes to a span normalizer
when set to ``otel``/``openinference``; otherwise the payload shape decides:
a NAT trajectory (``steps``), an already-normalized trace (``events``), or a
bare span list.
"""
if isinstance(payload, AtifTrace):
return payload
fmt = (source_format or "").lower()
if fmt in {"otel", "opentelemetry"} and isinstance(payload, list):
return normalize_otel_spans(payload)
if fmt == "openinference" and isinstance(payload, list):
return normalize_openinference_spans(payload)
if isinstance(payload, dict):
if "steps" in payload:
return normalize_nat_trajectory(payload)
if "events" in payload:
return AtifTrace.model_validate(payload)
return AtifTrace()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | 🏗️ Heavy lift

Unknown list traces are dropped here.

normalize_trace() returns AtifTrace() for any list unless source_format is otel/openinference. The live evaluator path still feeds list-shaped trajectory payloads, so this change can erase trace evidence instead of normalizing or rejecting it.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/nemo_evaluator_sdk/src/nemo_evaluator_sdk/values/evidence.py` around
lines 490 - 510, normalize_trace() currently drops any list-shaped payload
unless source_format is explicitly otel/openinference, which can erase live
evaluator trajectory evidence. Update normalize_trace() in evidence.py to
preserve unknown list traces by either routing them through the appropriate
list-based normalizer or explicitly validating and rejecting unsupported list
shapes instead of returning an empty AtifTrace(). Keep the existing dict
handling for steps/events and ensure the fallback path for bare lists is handled
safely rather than silently discarding data.

Comment on lines +534 to +541
if descriptor.ref is not None:
source = _local_filesystem_ref(descriptor.ref)
if not source.is_file():
return descriptor
trace = normalize_trace(_read_trace_payload(descriptor), source_format=descriptor.format)
atif_path = source.with_name(f"{source.stem}.atif.json")
atif_path.write_text(trace.model_dump_json(), encoding="utf-8")
return descriptor.model_copy(update={"format": "atif", "ref": str(atif_path), "data": None})

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Remote trace refs now throw instead of no-oping.

The docstring says unresolvable refs are returned unchanged, but _local_filesystem_ref() raises on non-local refs before the is_file() guard runs.

Proposed fix
     if descriptor.ref is not None:
-        source = _local_filesystem_ref(descriptor.ref)
+        try:
+            source = _local_filesystem_ref(descriptor.ref)
+        except ValueError:
+            return descriptor
         if not source.is_file():
             return descriptor
         trace = normalize_trace(_read_trace_payload(descriptor), source_format=descriptor.format)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if descriptor.ref is not None:
source = _local_filesystem_ref(descriptor.ref)
if not source.is_file():
return descriptor
trace = normalize_trace(_read_trace_payload(descriptor), source_format=descriptor.format)
atif_path = source.with_name(f"{source.stem}.atif.json")
atif_path.write_text(trace.model_dump_json(), encoding="utf-8")
return descriptor.model_copy(update={"format": "atif", "ref": str(atif_path), "data": None})
if descriptor.ref is not None:
try:
source = _local_filesystem_ref(descriptor.ref)
except ValueError:
return descriptor
if not source.is_file():
return descriptor
trace = normalize_trace(_read_trace_payload(descriptor), source_format=descriptor.format)
atif_path = source.with_name(f"{source.stem}.atif.json")
atif_path.write_text(trace.model_dump_json(), encoding="utf-8")
return descriptor.model_copy(update={"format": "atif", "ref": str(atif_path), "data": None})
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/nemo_evaluator_sdk/src/nemo_evaluator_sdk/values/evidence.py` around
lines 534 - 541, The ref-handling in `_local_filesystem_ref()` / the
`descriptor.ref is not None` branch currently raises for remote refs before the
`is_file()` no-op path can run, which conflicts with the documented behavior.
Update the logic so non-local or unresolvable refs are detected and returned
unchanged in `descriptor.model_copy` flow, and only call
`_local_filesystem_ref()` after confirming the ref is local or otherwise safe to
resolve. Keep the existing `normalize_trace` and `.atif.json` write path for
local files, but preserve the unchanged descriptor behavior for remote refs.

Comment on lines +671 to +682
async def trace(self, name: str = "trace") -> TraceHandle:
"""Return a cached normalized-trace handle for a named trace descriptor.

Async for consistency with :meth:`filesystem`/:meth:`logs` and to leave room
for ref staging/validation; the trace itself is read lazily on first access.
"""
cached = self._trace_cache.get(name)
if cached is not None:
return cached
handle = TraceHandle(self.require(name))
self._trace_cache[name] = handle
return handle

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Enforce kind="trace" here.

Unlike filesystem() and logs(), this accessor accepts any descriptor kind. That currently turns a non-trace "trace" entry into an empty trace instead of surfacing the contract break.

Proposed fix
-        handle = TraceHandle(self.require(name))
+        handle = TraceHandle(self.require(name, kind="trace"))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async def trace(self, name: str = "trace") -> TraceHandle:
"""Return a cached normalized-trace handle for a named trace descriptor.
Async for consistency with :meth:`filesystem`/:meth:`logs` and to leave room
for ref staging/validation; the trace itself is read lazily on first access.
"""
cached = self._trace_cache.get(name)
if cached is not None:
return cached
handle = TraceHandle(self.require(name))
self._trace_cache[name] = handle
return handle
async def trace(self, name: str = "trace") -> TraceHandle:
"""Return a cached normalized-trace handle for a named trace descriptor.
Async for consistency with :meth:`filesystem`/:meth:`logs` and to leave room
for ref staging/validation; the trace itself is read lazily on first access.
"""
cached = self._trace_cache.get(name)
if cached is not None:
return cached
handle = TraceHandle(self.require(name, kind="trace"))
self._trace_cache[name] = handle
return handle
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/nemo_evaluator_sdk/src/nemo_evaluator_sdk/values/evidence.py` around
lines 671 - 682, The `Evidence.trace()` accessor is currently using
`self.require(name)` without verifying the descriptor kind, so a non-trace entry
can be wrapped as a `TraceHandle` and look valid. Update `trace()` to enforce
`kind="trace"` before constructing the handle, matching the stricter contract
already implied by `TraceHandle` and the existing `filesystem()`/`logs()`
accessors. Keep the cache behavior in `_trace_cache`, but ensure only true trace
descriptors are accepted and anything else fails immediately.

Comment on lines +5 to 7
import json
import os
from pathlib import Path

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

Avoid the fixed 200 ms reap window.

Line 168 assumes the killed child will be fully reaped within 200 ms. On a busy runner, os.kill(pid, 0) can still see a zombie briefly and make this test flaky even when the process tree cleanup is correct. Poll to a short deadline instead.

Suggested fix
 import json
 import os
+import time
 from pathlib import Path
@@
     child_pid = int(pidfile.read_text().strip())
-    await asyncio.sleep(0.2)  # let the OS finish reaping the killed group
-    with pytest.raises(ProcessLookupError):
-        os.kill(child_pid, 0)
+    deadline = time.monotonic() + 2
+    while True:
+        try:
+            os.kill(child_pid, 0)
+        except ProcessLookupError:
+            break
+        if time.monotonic() >= deadline:
+            pytest.fail(f"child process {child_pid} still exists after timeout cleanup")
+        await asyncio.sleep(0.05)

Also applies to: 167-170

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/nemo_evaluator_sdk/tests/agent_eval/test_evidence.py` around lines 5
- 7, The process cleanup assertion in test_evidence.py is using a fixed 200 ms
wait, which can make the test flaky if the child is briefly a zombie on a busy
runner. Update the test around the cleanup check to poll `os.kill(pid, 0)` until
a short deadline instead of sleeping once, so the assertion waits for reaping to
complete. Use the existing process-tree cleanup/test logic in `test_evidence` to
locate the affected check and keep the polling window small.

metadata: dict[str, Any] = Field(default_factory=dict)


def normalize_nat_trajectory(payload: dict[str, Any]) -> AtifTrace:

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this normalisation is not accurate. need to be fixed

Add the ATIF (Agent Trace Interchange Format) model and normalizers for NAT
trajectories, OpenTelemetry GenAI spans, and OpenInference spans, plus the
TraceHandle and LogHandle read handles exposed via CandidateEvidence.trace()
and CandidateEvidence.logs().

Includes the example-only inefficient_retry_loop metric that scores over the
normalized trace evidence.

Vendored mirror synced via make vendor.

Signed-off-by: Arpit Singh (SW-CLOUD) <arpsingh@nvidia.com>
@arpitsardhana arpitsardhana force-pushed the aalgo-258-p0-evidence-metrics/arpsingh branch from 9c8325f to 2da7a67 Compare June 26, 2026 07:52
Comment on lines +66 to +67
initial_name: str = EVIDENCE_INITIAL_STATE,
final_name: str = EVIDENCE_FINAL_STATE,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question, do you think these need to be injected here? When would the initial/final states have a different name?

Comment on lines +115 to +116
if evidence is not None and evidence.get(self._evidence_name) is not None:
calls = await (await evidence.trace(self._evidence_name)).tool_calls()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would the user choose to get different evidence types in the metric? It seems like this always relies on it being EVIDENCE_TRACE, right?

"""Return the normalized ATIF trajectory, reading and parsing on first access."""
if self._trajectory is None:
payload = await asyncio.to_thread(self._load_payload)
self._trajectory = normalize_trace(payload, source_format=self._descriptor.format)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I quite understand the process in the PR. We end up normalizing the EvidenceDescriptor which converts the payload to atif, but we also call it here to normalize the trajectory. Do we need both? Or am I reading this incorrectly. I'd assume the normalizing happening here is sufficient and we don't need to pre-emptively normalize the evidence descriptor.

trace = EvidenceDescriptor(kind="trace", format="json", data=sample["trajectory"])
# Normalize to ATIF before the trial is persisted so the stored shape is
# source-agnostic (sources in, ATIF out); TraceHandle then reads it uniformly.
trace = normalize_trace_descriptor(EvidenceDescriptor(kind="trace", format="json", data=sample["trajectory"]))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the right approach tbh. I think persisting the raw trace and convert on metric usage is the right approach. The problem with converting before persisting is that if there's a bug in our conversion code or we don't handle the source format fully, the conversion may be lossy. At least if we convert on read, users can improve the conversion function without loss.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants