fix(sdk): align script grader results#1659
Merged
Merged
Conversation
Deploying agentv with
|
| Latest commit: |
e8ddf96
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://2886af14.agentv.pages.dev |
| Branch Preview URL: | https://grading-sdk-script-api.agentv.pages.dev |
1b2cf44 to
e8ddf96
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
SDK and script-grader authors can now return the finalized
pass,score,reason, and optionalchecks[]vocabulary directly. The SDK builders, Zod schemas, Vitest/workspace adapters, Python helper example, generated assertion scaffolds, and script-grader docs all present that vocabulary as the public surface.Core script-grader parsing now accepts the finalized JSON protocol, derives aggregate score/pass from checks when needed, and carries
reason/checksthrough the internal evaluator result. It still bridges checks into the current internalassertion_resultsshape so the artifact writer can be replaced separately byav-kfik.28.6.Publish-surface considerations: this intentionally updates the experimental
@agentv/sdkresult surface instead of documenting staleassertions[]/passedaliases as supported public API. The deprecatedCodeGraderResulttype name remains as an alias to the new result schema, but the wire/public shape is the finalized vocabulary.Related: av-kfik.28.4
Validation
bun run buildbun test packages/sdk/test/define-script-grader.test.ts packages/sdk/test/workspace-grader.test.ts packages/sdk/test/vitest-workspace-grader.test.tsbun test packages/core/test/evaluation/graders/script-grader-plain-text.test.ts packages/core/test/evaluation/script-grader-file-backed.test.ts packages/core/test/evaluation/script-grader-multimodal.test.tsbun test packages/core/test/evaluation/graders.test.ts packages/core/test/evaluation/execution-metrics.test.tsuv run pytestinexamples/features/sdk-pythonbun run lintgit diff --checkbun --env-file=.env apps/cli/src/cli.ts eval run examples/features/script-grader-sdk/evals/suite.yaml --target local_cliexercised the SDK script grader path; script grader per-grader score was 1.0 with all returned checks passing. The overall eval scored 50% because the separate LLM rubric target did not resolve from the example-local target config.--grader-target openai; the LLM grader reached provider execution but failed after retries withpi-ai call failed: Connection error. Script-grader API dogfood is covered; live LLM rubric dogfood is blocked by provider connectivity in this worktree.