Skip to content

feat(evaluator): add Trial→Intake boundary mapping module (D8)#443

Open
SandyChapman wants to merge 1 commit into
mainfrom
aalgo-289-intake-mapping-module/schapman
Open

feat(evaluator): add Trial→Intake boundary mapping module (D8)#443
SandyChapman wants to merge 1 commit into
mainfrom
aalgo-289-intake-mapping-module/schapman

Conversation

@SandyChapman

@SandyChapman SandyChapman commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

What

Adds plugins/nemo-evaluator/src/nemo_evaluator/intake/mapping.py — the single pure layer (D8, AALGO-289) that translates Evaluator vocabulary into the platform SDK's typed Intake request params. D3/D4/D5 obtain their request shapes and field names only from here, so a later glossary rename is a one-file change.

Three pure functions:

  • trial_to_atif_ingest(trial, ...) -> AtifCreateParams — minimal single-step trajectory from trial.output until D2 trace normalization; defaults agent.version (design §3.9 feat(evaluator): port metric output protocol #6).
  • score_to_evaluator_results(score, *, session_id, span_id) -> list[EvaluatorResultCreateParams] — one row per MetricOutput, name="{metric_type}.{output}". span_id is a caller-supplied parameter because it's server-assigned and only knowable after the ATIF ingest + span lookup — the orchestration (loop trials → POST atif → resolve span → coerce scores) belongs to the D3/D9 adapter, not this pure module.
  • run_task_to_experiment_context(trial, *, experiment_id) -> ExperimentContextParam — lean {experiment_id, test_case_id}.

Design notes

  • Returns the generated nemo-platform-sdk *CreateParams TypedDicts, not hand-shaped dicts. At runtime they're plain dicts the adapter splats into client.intake.ingest.atif.create(**body); statically, ty checks our field names / literals / nested shapes against the real generated schema, so an API change surfaces as a type error instead of drifting silently. Imports the SDK client types (nemo_platform.types.intake.*, already a plugin dep), never the Intake service (nmp.intake.*).
  • CATEGORICAL coercion is deferred — a category and free text are indistinguishable at the value level today (both arrive as str/Label), so everything string-valued maps to TEXT until a real signal exists.

Tests

  • All four/used data_type coercions + the MetricOutput.value.root unwrap.
  • trial_to_atif_ingest shape, version defaulting, missing-output, final_metrics.
  • score_to_evaluator_results naming, one-row-per-output, comment-from-diagnostic.
  • Import-hygiene guardrail: nothing under nemo_evaluator.intake imports the Intake service.

Verification

ruff (style + format), copyright headers, no-nmp_common-in-plugins guard, ty, and 21 unit tests all green.

Refs: AALGO-289. Informs D3/D4/D5/D9.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added support for sending evaluator runs and scores into Intake with richer metadata, including session tracking, experiment context, agent details, and final metrics.
    • Metric outputs are now converted more reliably across boolean, numeric, and text values.
  • Bug Fixes

    • Improved handling of missing outputs and optional fields so generated ingestion payloads are more consistent.

Adds plugins/nemo-evaluator/src/nemo_evaluator/intake/mapping.py: the single
pure layer that translates Evaluator vocabulary (AgentEvalTrial,
AgentEvalTaskScore, MetricOutput) into the platform SDK's typed Intake request
params, so the D3/D4/D5 write-adapters share one source of request shapes.

- trial_to_atif_ingest -> AtifCreateParams (minimal single-step trajectory
  until D2 trace normalization; defaults agent.version per design §3.9 #6).
- score_to_evaluator_results -> list[EvaluatorResultCreateParams], one row per
  MetricOutput, name='{metric_type}.{output}', span_id supplied by the caller
  (resolved post-ingest; the adapter owns that orchestration).
- run_task_to_experiment_context -> ExperimentContextParam (lean
  {experiment_id, test_case_id}).

Returns the generated nemo-platform-sdk *CreateParams TypedDicts (runtime
dicts, statically checked against the real schema) rather than hand-shaped
dicts; imports the SDK client types, never the Intake service (nmp.intake.*).
CATEGORICAL coercion is intentionally deferred (strings -> TEXT) until a real
signal exists. Includes unit tests for all coercions + the .root unwrap and an
import-hygiene guardrail.

Refs: AALGO-289

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Sandy Chapman <schapman@nvidia.com>
@SandyChapman SandyChapman requested review from a team as code owners June 24, 2026 19:35
@github-actions github-actions Bot added the feat label Jun 24, 2026
@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e1d8c744-4095-4b46-819e-3750fc02b81b

📥 Commits

Reviewing files that changed from the base of the PR and between d9e1851 and f937fa5.

📒 Files selected for processing (3)
  • plugins/nemo-evaluator/src/nemo_evaluator/intake/mapping.py
  • plugins/nemo-evaluator/tests/intake/test_import_hygiene.py
  • plugins/nemo-evaluator/tests/intake/test_mapping.py

📝 Walkthrough

Walkthrough

Adds Intake mapping helpers that convert evaluator trials and scores into ATIF ingest payloads and evaluator result rows, plus tests for session IDs, experiment context, value coercion, and import restrictions.

Changes

Intake mapping boundary

Layer / File(s) Summary
Module contract and trial ingest
plugins/nemo-evaluator/src/nemo_evaluator/intake/mapping.py
Defines ATIF constants, session and experiment-context helpers, and the trial-to-ATIF ingest payload builder.
Score result coercion
plugins/nemo-evaluator/src/nemo_evaluator/intake/mapping.py
Maps task scores into evaluator result rows and classifies metric values into BOOLEAN, NUMERIC, or TEXT fields.
Mapping unit tests
plugins/nemo-evaluator/tests/intake/test_mapping.py
Covers session IDs, experiment context, ATIF ingest payloads, and evaluator result coercion and diagnostics.
Import hygiene guardrail
plugins/nemo-evaluator/tests/intake/test_import_hygiene.py
Scans the intake package for forbidden Intake service, transport, and HTTPX imports and fails on any matches.

Sequence Diagram(s)

sequenceDiagram
  participant AgentEvalTrial
  participant trial_to_atif_ingest
  participant run_task_to_experiment_context
  participant AtifCreateParams
  participant AgentEvalTaskScore
  participant score_to_evaluator_results
  participant _coerce_metric_value
  participant EvaluatorResultCreateParams

  AgentEvalTrial->>trial_to_atif_ingest: build ATIF ingest payload
  trial_to_atif_ingest->>run_task_to_experiment_context: derive experiment_context
  run_task_to_experiment_context-->>trial_to_atif_ingest: experiment_id, test_case_id
  trial_to_atif_ingest->>AtifCreateParams: assemble schema, agent, step, metrics

  AgentEvalTaskScore->>score_to_evaluator_results: map score outputs
  loop each output
    score_to_evaluator_results->>_coerce_metric_value: unwrap and classify value
    _coerce_metric_value-->>score_to_evaluator_results: data_type, value, string_value
    score_to_evaluator_results->>EvaluatorResultCreateParams: emit result row
  end
Loading

Possibly related PRs

Suggested labels

feat

Suggested reviewers

  • ngoncharenko
  • asutermo
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 22.73% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding the Trial→Intake boundary mapping module.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch aalgo-289-intake-mapping-module/schapman

Comment @coderabbitai help to get the list of available commands.

@github-actions

Copy link
Copy Markdown
Contributor
Suite Lines Covered Line Rate Branch Rate
Unit Tests 20908/27474 76.1% 61.2%
Integration Tests 12109/26243 46.1% 19.5%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant