feat(evaluator): add Trial→Intake boundary mapping module (D8)#443
Open
SandyChapman wants to merge 1 commit into
Open
feat(evaluator): add Trial→Intake boundary mapping module (D8)#443SandyChapman wants to merge 1 commit into
SandyChapman wants to merge 1 commit into
Conversation
Adds plugins/nemo-evaluator/src/nemo_evaluator/intake/mapping.py: the single pure layer that translates Evaluator vocabulary (AgentEvalTrial, AgentEvalTaskScore, MetricOutput) into the platform SDK's typed Intake request params, so the D3/D4/D5 write-adapters share one source of request shapes. - trial_to_atif_ingest -> AtifCreateParams (minimal single-step trajectory until D2 trace normalization; defaults agent.version per design §3.9 #6). - score_to_evaluator_results -> list[EvaluatorResultCreateParams], one row per MetricOutput, name='{metric_type}.{output}', span_id supplied by the caller (resolved post-ingest; the adapter owns that orchestration). - run_task_to_experiment_context -> ExperimentContextParam (lean {experiment_id, test_case_id}). Returns the generated nemo-platform-sdk *CreateParams TypedDicts (runtime dicts, statically checked against the real schema) rather than hand-shaped dicts; imports the SDK client types, never the Intake service (nmp.intake.*). CATEGORICAL coercion is intentionally deferred (strings -> TEXT) until a real signal exists. Includes unit tests for all coercions + the .root unwrap and an import-hygiene guardrail. Refs: AALGO-289 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Sandy Chapman <schapman@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughAdds Intake mapping helpers that convert evaluator trials and scores into ATIF ingest payloads and evaluator result rows, plus tests for session IDs, experiment context, value coercion, and import restrictions. ChangesIntake mapping boundary
Sequence Diagram(s)sequenceDiagram
participant AgentEvalTrial
participant trial_to_atif_ingest
participant run_task_to_experiment_context
participant AtifCreateParams
participant AgentEvalTaskScore
participant score_to_evaluator_results
participant _coerce_metric_value
participant EvaluatorResultCreateParams
AgentEvalTrial->>trial_to_atif_ingest: build ATIF ingest payload
trial_to_atif_ingest->>run_task_to_experiment_context: derive experiment_context
run_task_to_experiment_context-->>trial_to_atif_ingest: experiment_id, test_case_id
trial_to_atif_ingest->>AtifCreateParams: assemble schema, agent, step, metrics
AgentEvalTaskScore->>score_to_evaluator_results: map score outputs
loop each output
score_to_evaluator_results->>_coerce_metric_value: unwrap and classify value
_coerce_metric_value-->>score_to_evaluator_results: data_type, value, string_value
score_to_evaluator_results->>EvaluatorResultCreateParams: emit result row
end
Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Contributor
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds
plugins/nemo-evaluator/src/nemo_evaluator/intake/mapping.py— the single pure layer (D8, AALGO-289) that translates Evaluator vocabulary into the platform SDK's typed Intake request params. D3/D4/D5 obtain their request shapes and field names only from here, so a later glossary rename is a one-file change.Three pure functions:
trial_to_atif_ingest(trial, ...) -> AtifCreateParams— minimal single-step trajectory fromtrial.outputuntil D2 trace normalization; defaultsagent.version(design §3.9 feat(evaluator): port metric output protocol #6).score_to_evaluator_results(score, *, session_id, span_id) -> list[EvaluatorResultCreateParams]— one row perMetricOutput,name="{metric_type}.{output}".span_idis a caller-supplied parameter because it's server-assigned and only knowable after the ATIF ingest + span lookup — the orchestration (loop trials → POST atif → resolve span → coerce scores) belongs to the D3/D9 adapter, not this pure module.run_task_to_experiment_context(trial, *, experiment_id) -> ExperimentContextParam— lean{experiment_id, test_case_id}.Design notes
nemo-platform-sdk*CreateParamsTypedDicts, not hand-shaped dicts. At runtime they're plain dicts the adapter splats intoclient.intake.ingest.atif.create(**body); statically,tychecks our field names / literals / nested shapes against the real generated schema, so an API change surfaces as a type error instead of drifting silently. Imports the SDK client types (nemo_platform.types.intake.*, already a plugin dep), never the Intake service (nmp.intake.*).str/Label), so everything string-valued maps toTEXTuntil a real signal exists.Tests
data_typecoercions + theMetricOutput.value.rootunwrap.trial_to_atif_ingestshape, version defaulting, missing-output, final_metrics.score_to_evaluator_resultsnaming, one-row-per-output, comment-from-diagnostic.nemo_evaluator.intakeimports the Intake service.Verification
ruff(style + format), copyright headers, no-nmp_common-in-plugins guard,ty, and 21 unit tests all green.Refs: AALGO-289. Informs D3/D4/D5/D9.
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Bug Fixes